Do Something Well, for Its Own Sake

When it comes to "What makes a good engineering culture", among many other things, one thing should be on the list: the desire to do something well, for its own sake.

Not for delighting customers, not for showing off, not for making metrics look good, not for bonus and promotion, not for passing code review, not for meeting manager's requirements, not for not looking bad. Just for its own sake.

Why? Because in software engineering, there are so many places and so many ways to cut corners without being easily noticed. We can save some time by not double confirming the numbers and facts which we use to back a decision, by closing a resolved bug without really seeing it passing in the exact condition as described in the bug report, by making a one-time mitigation in live site to just make the problem gone, without capturing the steps into a troubleshooting guide (not to mention to automate it).

Those behaviors can cause big hidden damages to the team:

  1. Bad money drives out good. People who cut corners like that will get more work done in the same period of time, compared to those who really do things well. Over the time, the latter are less reward and get promoted more slowly.

  2. Drags down the whole team's productivity. Often times, if a person saves 10 minutes work (because he is tired in the evening, he doesn't want to miss the shuttle bus, etc.), later, many other people in the team will have to spend extra minutes each and it adds up to be way more than 10 minutes. The team ends up losing hours of time to save 10 minutes.

  3. Makes it harder for the team to learn from its mistakes. The damages or cost of those corner-cutting may only start to emerge after a year or two. When the damage happens, it would be harder to connect back to what was done wrong in the first place.

The desire has to come from deep in people's heart, from each individual's own value system. The benefit of doing things right are either intangible, inquantifiable and unmeasurable (hence hard to be justified and rewarded), or will only be seen in long term. It takes some altruism. In the opposite, the benefit of cutting corners are tangible, measurable, immediate and self-benefiting, which goes well with the self-interested nature of human being.

How does it reconcile with "Perfect is the enemy of good"? I don't think they are contradicting. They are balancing each other. "Perfect is the enemy of good" is about optimization and maximizing the overall return. It really bothers me to see people cutting corners and lowering the bar in the name of "perfect is the enemy of good".

Be a Lifelong Learner

Note: someone asked "what happens to older programmers?" on Quora. I wrote this answer.

My few cents:

1. We all need to acknowledge that age discrimination exists and is bad (more reading: "What it Feels Like to Be Washed Up at 35"). People have acknowledged the gender gap in tech industry, but not that much of age discrimination.

2. I've worked with several older programmers (in their individual contributor role) in recent years. They are good, as good as other colleagues in any age. Some work on service products and do on-call (aka. pager) the same way as the young folks (those in 20s). I see age is just like other attributes like gender and race. They are orthogonal to job performance.

3. It is fair to ask about "why this guy is in the same job level as other folks who are 20 years younger? Shouldn't he/she have advanced his/her career quite a lot in the last 20 years?". To some extent, that's something the recruiters and hiring managers better probe, because that's about a person's growth trajectory. Past trajectory is a useful reference to evaluate a candidate's future growth potential. When looking for the answer of that question, it's important to use unbiased eyes. There could be many reasons. For example, the candidate has flatten at a senior job level, which may be OK to some employers. The candidate may have chosen to much slow down his career advancement in exchange for other things like taking care of sick family member. The candidate may be relatively new to the profession, though he/she is relatively older. That happens. Some people switch profession in 40s. They deserve a fair chance.

4. Although it's true that there can't be that many management positions, it's understandable and OK for many young programmers to want to be a manager in the future. Just like it's understandable and OK for many kids to want to be the President of the United States, although there can't be that many President positions (usually there are only 1 in every 4 years). Over the time, among those young programmers who want to be a manager, some of them will become managers, while some others will figure out that either manager is not the right job for them, or they are not (yet) the right person for the manager job.

5. Many people believe that older people are not suitable, at least not competitive, for the programmer job. Their reasons are usually about the energy level, the physical fitness, the need to spend time on kids, the fast evolving technology landscape, etc.. Except for the last one (fast changing technology landscape), all the other reasons are obviously irrelevant or trivial at the best. Take truck driver as example. Truck driver is a relatively physically demanding job. It's intuitive for people to ask "what happens to older truck drivers". People may think that young drivers are more productive because they may be able to go on for longer hours between rests. They may think young drivers have the advantage that they don't need to spend time on kids. Similar to how people view young programmers vs. old ones.

But according to online data (File:Truck driver workforce by age.PNG), there is a large portion of older truck drivers and the number is somewhat growing over the last 10 years (see the 55-64 and 65+ age groups):

Truck driver workforce by age

None of the consequences due to aging, neither the physical ones (e.g. declining memory) nor the social ones (e.g. time spent on kids), would be a problem to the programmers in general.

6. Regarding the fast evolving technology landscape. Actually that may be just a perception or a partial view rather than full truth, depend on how you look at it. It's true that we used to have no more than a handful choices of databases (Oracle, SQL Server, DB2, MySQL, PostgreSQL, Sybase) but now there are countless choices of databases, most being NoSQL databases. On the other hand, if you look at The Top 10 Programming Languages, the top guys are all decades old: Java, 20 years; C, 43 years; C++, 32 years; PHP, 20 years. Even C# has 15 years already. Although every a year or two there are new language features and new programming frameworks for these programming languages, those are just the normal needs of continuous learning as also seen in many other professions: accountants are faced with tax code and regulation changes every a couple years; for teachers, course books keep changing and in particular, there is the new Common Core; etc. In general, to be successful in any profession, one needs to be a lifelong learner.

Monitoring and QA are the Same Thing? (Part 3)

In Part 2, I pointed out that live site monitoring is about answering two questions: “is everything working?” and “is everything healthy?”, and invariants, logs and synthetic transactions are the three ways to find out the answers.

For those who are building/using/improving live site monitoring, besides how to find out the answers, they also need to be aware of and consider the following four aspects:

I. Knowing what has changed in production helps the monitoring more effectively answer "is everything working" and "is everything healthy".

Some changes are triggered by human. For example, a code change or a configuration change. Pragmatic data shows that about 2/3 live site incidents were triggered by code changes and configuration changes (including the situation of a pre-existing bug, which may continue to remain dormant, until it's triggered/surfaced by a new version rollout or configuration change). So after rolling out a new version or having flipped on/off a configuration setting, we want to get the answers the sooner the better (it's Ok and reasonable to first have some preliminary assessment quickly, then take a bit more time to get a fuller assessment). Similarly, when a live site issue is fixed, we also want to get the answer ASAP to confirm the fix. When there are manually triggered changes, we know about the delta (in code or in configuration), so that we may first look at the areas related to the delta when trying to answer "is everything working" and "is everything healthy".

Some changes are not triggered by human. They are just naturally happening as time goes by, or under the influence of the external environment. For examples: the message queue length grows as incoming traffic picks up; the data store size grows over the time; a system is slowly leaking resources; password expired, certificate expired, customer accounts are deleted by an automatic backend job after 90 days non-payment, etc.. Some of such changes can build up very quickly, such as a surge of simultaneous connection count during men's ice hockey semifinal between the USA and Canada at the 2014 Winter Olympics. Knowing what are changing even when no one is touching the system, it helps target the monitoring more precisely.

II. Understand the confidence level of the answer.

Anyone can give an answer to the question "is everything working". I could answer "yes" out of my ignorance ("I haven't heard about any incidents"). That's a legitimate answer, but a very low confidence answer. When it comes to live site monitoring and answer the questions "is everything working" and "is everything healthy", we need the answers to be in higher confidence, to reduce false alarm or false positive.

The bar of "high confidence" may vary. For example, we may tune the monitoring system to be a little bit more conservative during the nights, so that we don’t wake people up too easily and prematurely. We could be more aggressive (lowering the bar) during working hours or special occasions (e.g. the winter Olympic).

Time is the key factor for a monitoring system to gain confidence on the answer (either positive or negative). For a slow-building issue, it usually takes hours or days to confirm it. To differentiate between a jump vs. a single spike, it needs to collect the data for a bit longer. In live site monitoring, we often quickly do a sanity check to give a preliminary answer (low confidence), then spend more time to be sure (higher confidence).

In one word, it takes time to get higher confidence. That's why shorter MTTD (mean time to detect) and low noise ratio is mutually exclusive in general. That seems pretty obvious here. But in reality, many people can forget that in day-to-day work, especially in complex context. I have seen that as a common pitfall. People make designs which try to get shorter MTTD and lower noise ration at the same time. Some leadership sometimes challenge the team to improve both -- it's not unachievable, but harder than most people think.

III. Understand the different levels of turnaround time (usually as referred as MTTD, mean time to detect) and understand what kind of monitoring approach (invariants, logs and synthetic transactions) we should invest in to either move into a higher level of responsiveness or improve within the same level.

The basic level of turnaround time is to know the issue after customers have run into it. Synthetic transactions may not be the best place to invest in, if we want to shorten the time from the first several customers have hit the issue to the time we know about it. Instead, we should more rely on detecting anomalies and outliers based on aggregated logs.

It will be better if we know the issue before it affects any customer. That's a much better turnaround time. In order to get ahead of the customers, we must use the synthetic transaction approach. The other two approaches (invariants and logs) cannot help when there is no customer impact yet. However, as pointed out in Part 2, synthetic transaction can become very expensive if we want to use it to cover more granular cases. Which means, to balance the cost and benefit, it will be more practical and realistic to only invest in catching major problems ahead of customers and let the issues in granular cases be there until some customers are affected. In the other words, catching all the live site issues ahead of customers should not be the North Star.

Some may say, shouldn't the most ideal turnaround time be the negative turnaround time: detect the issue even before it exists in production. Of course that's even better. But that is no longer a live site monitoring thing. Preventing issues from getting into live site is a QA responsibility, a software testing things.

IV. How will the answers be delivered?

Many people equals this to sending alerts. But sending alert is just one way to deliver the answers of "is it working" & "is it healthy". There are many other delivery mechanisms. Some groups have a big flat screen TV on the way in their hallway, which shows a bunch of real time numbers and charts. When any issue happens, the numbers would turn red or flashing and the bar/line in the chart will shoot up high. Then it will get noticed by people who walk by. Such a flat screen TV is also a mechanism to deliver the answer. Sometime the answer is delivered without being request, such as when some thresholds are breached in the night, the on-call person will be called.

The differences between the delivery mechanisms are:

  1. Is the delivery guaranteed? Flat screen TV is not a guaranteed delivery, since we can't make sure the right people will just walk by and noticed the red numbers. Emails and text messages are not guaranteed delivery, either. People may not be checking email and text message all the time. Calling on-call person's cell phone is a guaranteed delivery: unless the on-call person answers the phone call, the call will be retried multiple times, and fall back on to the secondary on-call and the backup on-call, until someone answers the phone.

  2. Is the delivery explicit or implicit? Implicit: no bad news is good news. Explicit: we still want to keep receiving good news ("yes, everything is working fine in the last 30 minutes"), in order to have the peace of mind about the delivery channel (to avoid wondering whether there is no bad news, or the bad news got lost on its way?).

  3. How soon/fast is the answer delivered? It depends on factors including: how bad it is (severity, impact assessment); how much are we sure about it (confidence level). Usually there is a conflict between confidence level vs. how soon: we can aggressively send alerts, at the cost of high noise ratio, or we tune the monitoring system to wait until it's very much sure about the issue before sending the alerts, at the cost of longer MTTD (mean time to detection).

  4. Who is the recipient? Phone call is a guaranteed delivery, but only one person will receive it at a time. Emails can be sent to a group of person of our choice. Flashing screens in the hallway will be seen, but we don’t know who exactly will see it. We also want the message get delivered to the right people, who need to know and/or can do something about.

Among all the ways, no one is the best way. There is only the right way, in different situations. "Right" = deliver the message to the right audience, with the right balance between shorter delay and higher confidence level, and containing the right level of details and actionable data.

Summary: this blog series (Part 1, Part 2, Part 3) captures the mind model that I use to look at live site monitoring. This model helps me better see the context of each topic related to monitoring, see where they are in the bigger picture and how they relate to each other.

//the end

Monitoring and QA are the Same Thing? (Part 2)

In Part 1, I mentioned that there are three monitoring approaches: a) invariants, b) logs, c) synthetic transactions. Here I like to explain them.

Before I do, I want to put them in a broader context: what is live site monitoring about. In my opinion, live site monitoring is about answering two questions:

"Is everything working?" & "Is everything healthy?"

There is a subtle but important difference between working vs. healthy: working = producing expected results (in terms of correctness, latency, ...); healthy = being in a good condition (in order to remain working in the future).

A few real life examples may help elaborate the difference between working vs. healthy:

  1. A car is driving (working), but water temperature is running high (not healthy, it may break down soon);

  2. A team is shipping product on time (working) but people are burned out (not healthy), or, people have good work/life balance (healthy) but are not delivering (not working);

  3. An ATM machine can still spit money (working), but it’s low on cash (not healthy);

  4. A Web server is serving requests fine (working), but its CPU is at 70% on average (not healthy), or, it’s returning “Internal Server Error” pages (not working) though its CPU is below 10% (healthy), or it's running on high CPU (not healthy) and not responding to requests (not working).

The three approaches (invariants, logs and synthetic transactions) are the three different ways to find out the answer to the first question, "is everything working":

a) Invariants. That's the laws of physics. Invariants are the evaluations that should always be true. When an evaluation is not true, there is something wrong in somewhere. For examples,

  • It should always be true that the current balance equals to the previous balance - all spending since last + all income since last. If those numbers don't add up, something was wrong or missing.

  • If 200 is the hard ceiling of the number of virtual machines every account can have, and if some report says an account has more than 200 virtual machines, something must be wrong.

b) Logs, in a general sense. Not just the trace log of every API call, but also more importantly, the aggregated data. The log of every transaction can be aggregated at various levels on different dimension (per region, per OS type, per minute/hour/day/week/month/quarter, ...). Then we can analyze it to catch the anomalies and outliers. For examples,

  • The API foo usually have 0.05% failure rate, but in the last 5 minutes, its failure rate was above 1%.

  • In the last several weeks, the mean time to provision a new virtual machine was 4 minutes and the 99th percentile was 15 minutes. But in the last several hours, the mean time to provision a Extra Large size Windows Server 2012 R2 virtual machines jumped to more than 20 minutes, although the overall mean time remain unchanged.

The approach of detecting anomalies and outliers based on aggregated logs has some limitations:

  1. It won't help much when there isn't enough data. Low data volume can be due to various reasons: in a new instance that haven't been opened up to all customers, the traffic is low and there isn't a lot of historical data to establish a benchmark; some APIs are low volume by nature, such as only a few hundreds calls per day; etc..

  2. It may not be able to tell the full picture of how the system is perceived from outside. For example, when the API calls are blocked/dropped before they have reached our service, we won't have any log about those calls. (Note: however, a sudden drop of API call volume is an anomoly that can be caught through log analysis.

c) Synthetic Transactions. This is the most straightforward one: if the bank wants to know whether an ATM machine is working, they can just send someone to withdraw $20 from that ATM machine.

The synthetic transactions approach has some limitations, too:

  1. Representativeness. Even if the bank employee can withdraw $20 from that ATM machine, it doesn't guarantee all customers will be able to, too.

  2. Cost. If the bank uses this way to monitor all its ATM machines (could be thousands) are working, they have to send people to withdraw $20 from every ATM machine every hour. That will be huge amount of labor cost and millions of dollar getting withdrawn by bank employees which also needs to be returned to the bank.
  3. Blind spots, as explained in Part 1.

In the case of monitoring ATM machines, the approach of detecting anomalies/outlier based on aggregated logs will be much more effective. As long as the log shows that in the last a couple hours customers have been able to successfully withdraw money from an ATM machine, we know this ATM machine is working. On the other hand, for a busy ATM machine (e.g. sitting at the first floor of a busy mall), if the log shows that there is no money withdrawn in the last two hours between 2pm-4pm on Saturday afternoon, the bank has a reason to believe the ATM machine may not be working and better send somebody over there to take a look.

To be continued ...

Monitoring and QA are the Same Thing? (Part 1)

Many people think monitoring and QA are the same thing. They suggested that ultimately, we should be able to run the same QA test automation in production, as a monitoring approach to verify whether everything is working. Some said,

“your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA.”

I kind of disagree with that. For two reasons:

Reason 1. There are three monitoring approaches: 1) invariants, 2) logs, 3) synthetic transactions. By suggesting that monitoring and QA test automation are the same thing, people are equaling monitoring to synthetic transaction. However, the synthetic transaction approach is not sufficient to the need of live site monitoring. Synthetic transactions has some limitations:

  1. Representativeness. The fact that the test automation can provision a virtual machine in production doesn't guarantee all the users can do so too.
  2. Cost. Every time the test automation provisions a new virtual machine, there is a cost on the system across the layers. As we increase the coverage of the monitoring and reduce the MTTD (mean time to detect), such overhead will become significantly a lot. In other words, using QA test automation for monitoring is not very economic. Plus, the cost will exponentially increase and the marginal return of investment will drop as we use synthetic transactions to cover more granular scenarios. In the past, I have seen for multiple times that groups went down such a slippery slope:
    • In the beginning, the team's live site monitoring only provision one Small size Windows Server 2008 virtual machine in West US every 10 minutes.
    • Later, there happened a live site bug which only affected virtual machines running Windows Server 2012 OS. So instead of only covering Windows Server 2008, the team changed to cover all the OS types, including a few mainstream distro of Linux. That increased the number of synthetic virtual machines from one per 10 minutes to a half dozen per 10 minutes.
    • There happened another live site bug which caused virtual machine provisioning failing in some regions but not the others. Unfortunately, the synthetic transactions didn't catch it because it only creates virtual machines in West US. Then the team changed to cover all regions rather than just West US. That increase the number of synthetic virtual machines by another 10x.
    • The team was challenged to reduce the MTTD. They figured that 10 minutes interval was too long and they reduced it to every 5 minutes. That doubled the amount of synthetic virtual machines.
    • The team also ran into a live site issue that only affects A8 and A9 sizes. So on and so on. In a year or two, the number of synthetic virtual machines increased from 1 to 100s.
  3. Blind spots. By running synthetic transactions, you will verify the scenarios that you know about and catch issues in these scenarios. But there are other scenarios (often corner cases) that you don't know about or you don’t think it would be different. Hence, synthetic transactions won’t cover those scenarios and those are the places where some nasty live site incidents happened.

I believe that in live site monitoring, the other two approaches (logs and invariants) are needed to compensate these limitations of synthetic transactions. The three approaches are all useful and we need to effectively choose one or a combination of two or three for different purposes.

Reason 2. There are some important differences between the synthetic transactions used in monitoring vs. the test cases in QA test automation. By suggesting that monitoring and QA test automation are the same thing, people are equaling synthetic transactions to QA test automation. That's not the case. Leaving aside the fact that not all the test cases can run in production, not all the test cases need to run in production. Just for a couple examples:

  • In QA test automation, a test case will verify that provisioning a new virtual machine must fail when the size is not among the supported sizes list (e.g. the output of ListSizes API call). Once our code have passed this test case in the in-house test pass, we believe it will behave the same way in production.
  • In QA test automation, a test case will verify that a customer can create a new virtual in the new A10 size if and only if the customer has enrolled in the "LargeVirtualMachine" beta feature. Once our code have passed this test case in the in-house test pass, we believe when the code is running in production, it will also correctly honor the enrollment status of the beta feature.

In both examples, we don’t need to run these test cases in production as a part of the live site monitoring.

The below chart illustrates the two reasons above (the scale is not in proportion):

monitoring != QA test automation

To be continued ...

Make Sense of the Test Results

Test automation per se is pretty common today. Few software companies are still hiring an army of manual testers (either regular employees or contractors).

However, the test results coming out of the test automation are usually still pretty raw. A typical test report today tells you that there are X test cases failed and Y passed. For each failed test case, the test report shows a brief error message (e.g. “AssertionFailed Timed Out waiting for status to be [Started]”) with a hyperlink to the full detail test log. That’s pretty much it. Beyond that point, from what I can see, people in many organizations are spending lots of time to make sense of the test results.

They want to figure out things like:

  • Among the failed test cases, whether the causes are all the same, or there are different causes for each failure?

  • Which failures are new failures vs. the chronic failure or flaky tests?

  • For a new failure, I want to quickly narrow down to one or two suspicious recent checkins?

  • For a chronic failure, is the failure this time the same kind as in the past, or the same test case but due to different cause?

  • Is the failure already being tracked by a bug? Is someone already working on it?

  • Is the failure supposed to be fixed? If so, the tracking bug should be reactivated since the fix didn't seem to work.

  • Is the failure unique to my branch, or happening across the board? If it fails in other branches at the same time, it’s unlikely caused by changes in my branch and more likely an environment issue.

Besides understanding the failures, the engineers also care about the quality of the test automation:

  • Is the test pass taking longer time to finish than before? If so, why? Is that because: a) we have more test cases, b) the system-under-testing is now slower, c) a dependency of the system-under-testing is now slower, d) we are reaching the capacity limit of the test environment so things are getting queued/throttled, etc.

  • How is the repeatability of the test automation? What is the most flaky ones and what’s the reason?

Without help from good tools, the above analysis are laborious in many places today. No reason why we can’t use machines to perform these analysis and just put all the answers right in front of us soon after a test is finished. That shouldn't be too hard. Most ideally, the machine can just tell us whether the build is good to ship or not. It’s very much like visiting hospitals for a general check-up. I don’t want to just get a pile of papers full of numbers, charts, etc., because I don’t know how to interpret them: is 42 bad or good for HDL cholesterol level? If bad, how bad it is? I have premature contractions? What does that mean? At the end, I just want to be told that “you are doing fine, just need to lose some weight”.


p.s. This reminds me of a TED talk that I watched recently. The speaker, Kenneth Cukier, said:

Big data is going to steal our jobs. Big data and algorithms are going to challenge white collar, professional knowledge work in the 21st century in the same way that factory automation and the assembly line challenged blue collar labor in the 20th century. Think about a lab technician who is looking through a microscope at a cancer biopsy and determining whether it's cancerous or not. The person went to university. The person buys property. He or she votes. He or she is a stakeholder in society. And that person's job, as well as an entire fleet of professionals like that person, is going to find that their jobs are radically changed or actually completely eliminated.

Well, long before machines can tell cancer cells from good cells and kill the lab technician’s job, we should be able to make machines help us make sense of the test results.

My Favorite Coding Questions

I was going through the latest posts on Hacker News and Quora tonight and noticed a few ones about coding questions. I really don't like some of those coding questions, which ask you to "design an algorithm to ...", although those are among many people's favorites. For example:

Given a string s1 and a string s2, write a snippet to say whether s2 is a 
rotation of s1 using only one call to strstr routine? (e.g. given s1 = ABCD 
and s2 = CDAB, return true; given s1 = ABCD and s2 = ACBD, return false)

I know the answer. I heard this problem from my wife when she was preparing for her last job change. I thought hard about it for several minutes and had no clue. Then my wife told me the answer. I was wondering, what's the point to ask such questions in job interviews to hire programmers. It doesn't tell me much about what's the person's methodology of exploring possible solutions. It doesn't tell me much about whether this person is a good problem solver -- puzzle solver != problem solver. If this person happens to already know the answer (just like I do), this question becomes worthless.

When I do coding interviews for developers, no matter college candidates or industry candidates with several years experience, my favorite questions are those which use very straightforward algorithm. For example,

1. Search a number in a rotated sorted array.
2. Partially revert a linked list.

These kinds of questions help me find out whether the person can translate ideas into code quickly and correctly, which is the most common task in we developers' daily work. Most of the time, we already know how to solve the problem on the paper, the remaining thing is to just turn it into the code so that computers can execute it (to solve the problem for real). It's just like that in the restaurants, most cooks' daily job is to translate recipes into dishes, quickly and correctly. Only occasionally they need to come up with new recipes.