The Combined Engineering in Azure: A Year Later

Last year in Windows Azure^[1], we merged dev and test^[2] and switched to the combined engineering model^[3].

Recently I have been asked quite a few times about my view of that change. My answer was: it solved a few chronic problems in the traditional dev+test model. It solved these problems fairly easily and naturally. If we didn't do the combined engineering change, these problems would still be here today:

1. Quality is everyone's responsibility

We always said: quality is owned by everybody, not just the test team. In the reality, there were always some gaps, more or less. Some developers still had the mentality of "the test team would/should find the bug for me". Now there is no test team. Software engineers can count on nobody but themselves.

2. Improve testability

Although nobody disagreed with the importance of testability design, often times testability is treated as relatively lower priority by the developers in the traditional dev+test model. When they were under the time pressure, they naturally get the feature implemented first and it took long time for some testability requirements getting honored. The worse was that the developers didn't have the sense of testability in their mind when they design and write code. Quite some testability issues were found in pretty late stage when it's too costly/risky to change the design and code.

Now writing test code is a part of the software engineer's job. They have much strong incentive to improve testability because it will make their own work easier. Plus, they truly learn the lessons of poor testability designs because it hurts themselves.

No more begging to the developers to add an API for my test automation to poll to replace a hard-coded Sleep(10000).

3. Push tests to the left

I had hard time to convince some developers to write more unit tests. This is a true story: a dev in my team wrote a custom lock. I found that there was little unit test of that lock. I asked the dev. He told me he think the scenario tests^[4] has already covered it pretty well. I didn't know what to say. Yes we had code coverage data for unit test. But the hall of shame can only go this far.

Now developers (software engineers) own all the tests. Now they have all the incentives to push the tests to the left^[5]: put as much as tests in unit test, because it's fast, easy to debug and nearly free of noises. The integration test is obviously a less favorable place to put the test: it's slow, more hassle to debug and more noisy.

4. Hiring and retention

That was really, really, really a challenge all the time. Most college graduates prefer SDE than SDET^[6]. Partly because they had little exposure to what the SDET job is about. Partly because they are concerned with the "test" tag. Valid concern. Among the industry candidates, many of those who came from software testing background usually didn't meet our requirement of coding and problem skills, because in many places outside Microsoft, test engineers were mainly doing what the STE^[7] used to do in Microsoft. We ended up having to put a lot of effort in convincing developers from other companies to join Microsoft as SDET, which wasn't an easy sell.

Now, voila, problem solved. There is no more "test" tag. Everyone is "Software Engineer". No more SDET wants to switch to SDE to get rid of the "test" tag, because there is no more SDET.

5. Planning and resourcing

We used to do our planning based on dev estimate only. It was understandable. It's much messier to juggle if every work item has two prices (dev estimate and test estimate). In planning, we assume that for every work item, the test estimate is proportional to the dev estimate (e.g. 1:2, which came from our total test:dev ratio) and we believe the variances in each individual work item will average out. It worked OK most of the time. But there were several times where such model cause significantly under-funded test resources and caused crunch in late stage in the project.

Now when engineering managers and software engineers provide work estimate, the price tag has already included both dev estimate and test estimate. Nobody would underestimate the test cost because otherwise they would have to pay for it anyway.

To summarize, that's the power of the roles and responsibility model. In the past, I was the cook at our home and my wife usually do the cleanup. She always complained that I made the stove and counter-top very messy. Later we made a change: I do both cooking and cleanup (and she took some other housework from me). Then all of sudden I paid a lot of attention to not make kitchen messy because otherwise it would be myself that spend time to clean it up.

p.s. Of course there is also the downside of this change. That would be another topic. But the net is a big plus.

[1] I know I should have called it "Microsoft Azure" rather "Windows Azure". It's just the old habit. For us who joined Azure in its early years, we still call it Windows Azure.
[2] Before the merge, we had dev team and test team. Take myself as an example. I was the test manager leading the test team, partnering with the dev manager who led the dev team. My test team was about half of the size of the dev team. In the shift to combined engineering model, we simply merged and became one engineering team of about 70+ people.
[3] Strictly speaking, our shift to the combined engineering did not only include merging the dev and test, but also redefined the role of PM, which now lean toward the market, customer and competition more than the internal engineering activities, and enlarged the role of the new "software engineer" role （which started from the sum of original dev+test) by adding more DevOps responsibilities.
[4] We didn't differentiate these terms: scenario test, functional test, e2e test, integration test. Our dev did help write quite some functional/scenario tests when test team was running tight. But by and large, the test team owned everything after unit test.
[5] We usually draw a timeline on the whiteboard, from left to the right: the developer changes code in his local repo -> unit test -> other pre-checkin tests -> checkin -> integration tests -> start production rollout -> rollout completed. So "push tests to the left" means push them into the unit test.
[6] SDE = Software Development Engineer. SDET = Software Development Engineer in Test (aka "tester").
[7] STE = Software Test Engineer. Microsoft had this job title until 2005/2006. STE's main job responsibility was writing test spec, enumerating test cases, execute test cases (mainly manually), exploratory tests, etc.. Many STEs had very good analytical skills, knowledgeable of our product and good soft skills, but relatively weak in coding, debugging, design, etc..

Zhengziying's Blog