Zhengziying's Blog: March 2016 Archives

Overall speaking, code review is useful.

From time to time, my reviewers did catch some bugs that I overlooked, or point out options that fell into my blind spot. Sometimes I learn new things in code reviews. For example, last year a reviewer suggested me use ThreadLocal<T>, which indeed would simplify my code a lot. Some code reviews aren't the code review per se, when it's more about a FYI: "Hi, I have written this code and I like you to be aware of and get familiar with it." That happens when you are the only person in the team truly understand that piece of legacy code. Although such FYI type of code review isn't that much helpful to the author, it's still good to the team.

But sometimes code review can become annoying, especially when people spend time on things that (in my opinion) don't really matter. For examples:

string.Join() or String.Join()?
Put using directives inside or outside the namespace block?
readonly static vs. const
System.InvalidOperationException vs. just InvalidOperationException
Optional parameter vs. overloading
User var or explicit declaration?

I understand there are differences, often very subtle or trivial though, between string vs. String, readonly static vs. const, etc.. But those differences don't do any real harm. Explicit declaration, such as Stopwatch stopwatch = Stopwatch.StartNew(), doesn't make the code harder to read^[1] than var stopwatch = Stopwatch.StartNew(). String.Join doesn't make the code slower than using string.Join. Putting using outside of the namespace block doesn't make the code harder to work on. In addition, by default all versions of Visual Studio put using outside of namespace in the code it generates.

I really don't like to spend time on those things in code reviews. I don't think they matter to product quality and engineers' productivity. There are so much more things that we wish we had more time to spend on to improve quality and productivity. Debating on those things are like debating whether to put one space after a period or two.

What people should do is to make sure that their team has collectively decided on what StyleCop^[2] rules to turn on in their code base and get included in the build. Once that's decided and has taken effect, there will be no debate any more: if the rules are violated, it will be a build error and we don't submit code review until it at least passes build. Simple and clear.

[1] Readability is both objective and subjective. There is no doubt about that a line longer than 120 or 150 characters is hard to read and single letter variable names are hard to read. But whether Stopwatch stopwatch = Stopwatch.StartNew() is harder to read than var stopwatch = Stopwatch.StartNew(), that's really personal.
[2] Or the equivalent of StyleCop in other languages.

In one of the teams that I have worked in, our code was full of retries. To name a few:

When the code was looking for a storage account, it would retry up to 30 times when the storage account didn't exist. The whole retries took 13 minutes.
When the code was looking for a setting value which didn't exist, it would also retry 30 times which took 13 minutes;
When the code was trying to read an XML file from a file share, it would retry up to 30 times for 13 minutes when the XML file didn't exist.

That was too much. When a system has too many retries for too many times all over the place, the worst consequence isn't that it wastes time. The worst consequence is that it erodes the quality of the system in an irreversible way. I have seen some groups where people were used to simply adding more retries and retry for more times when they run into intermittent issues. Lots of genuine code bugs are intermittent in its nature. Adding retries will cover these bugs up. But over the time, as the code gets reused, as the system scales, the only way to maintain the same level of availability is to add more and more retries. It's like only reinforcing the siding of the house when its frame is being eaten by termites. Adding retry is simpler and less costly than getting to the bottom of intermittent issues. It takes some courageousness to not go down the easier (but irreversible and poisonous) path.

My rule of thumbs for retry is that we should not retry, or only retry for very few times, in the following situations:

AuthN/AuthZ failure
HTTP 404 or similar error code indicating resource not found
HTTP 400 or similar error code indicating bad input
Timeout

In most of the time, if a resource is not found, it's not found. Retry won't help. The resource is not going to show up all of a sudden several minutes later^[1]. In general I also avoid retry on timeout error. Timeout indicates something has become very slow. Maybe the database is too busy. Maybe the frontend is under high load. Retry on timeout will likely make things worse.

The situations where I think it's necessary and useful to do some retries are:

Network glitches, such as “System.ServiceModel.CommunicationException: The socket connection was aborted”
HTTP 500 or similar error code that indicates a server side error
HTTP 503 or similar error code that certain systems use in throttling. Sometime the response will include some guidance for the client: whether the client may retry, how much time should the client backoff for, ...

It worth pointing out that retry should be mainly used to smooth out glitches^[2], rather than for handling outage. If a service/resource has problems for more than a few seconds, it's an outage. When there is an outage, the caller should treat it like an outage: mark the operation as failure, write necessary logs, tell its caller that things have failed. Usually I frown when I see a code spend more than a few seconds to retry.

Another often debated topic about retry is where shall we put the retry. My take is: either as close as possible to where the failure happens, or as close as possible to the originator of the bigger operation. The benefit of putting retries close to where the failure happens is that you won't have to worry about idempotency and side-effect too much when your code retry. But you have to be very careful with the footprint of the retry. Excessive retries in the lower level code can be amplified when it's placed in the whole system.

Last but not least: do not have a default retry count and interval in your code base. In the examples provided above, they all retried 30 times for 13 minutes. That's because they all use the internal Retry lib with default values for retry count (which was 30) and interval length. To make it worse, it spreads when people copy and paste code (they do!). Not having a default retry count and interval length will enforce the developers to make conscious choice every time about how many times they want the code to retry.

[1] Retry is different than polling. We use polling when we expect to see changes. It's OK to spend several minutes or even longer to wait for things to happen.
[2] People also use other terms like "hiccups", "blips", "transient error".

Zhengziying's Blog

Monthly Archive — March 2016

When Code Review Becomes Annoying

Too Many Retries Erode the Quality