7 Things We Did Right in a Successful Data Migration Project

Someone was asking on Quora about how manage the migration of data when there is a database schema change. I shared how we did in a real data migration project back in 2006/2007. It was a payment system (similar to today's stripe.com, but ours wasn't for the public) that ran on .NET XML Web Service + SQL Server. In a much simplified way for ease of writing:

  • It had a Subscriptions database, in which there is the payment_instruments table, where we stored encrypted credit card numbers.

  • Having subscription_id on the payment_instruments table implied that we assume every payment instrument must belong to one and only one subscription.

2015-07-data-migration-before

Now we wanted to support standalone payment instruments, which doesn't belong to a subscription. So we needed to migrate the payment instrument data into a new payment_methods table in a new Payments database:

2015-07-data-migration-after

It was a very successful data migration project. It had done quite a few things right, which I will repeat in any future data migration projects:

  1. We kept the old payment_instruments table. We added a new payment_method_id field to the payment_instruments table, so that the payment_instruments table acts as a proxy. The benefit is: we can keep most of the legacy code untouched, which can continue consume the payment_instruments table. We just need to change the data access layer a bit, to back fill the encrypted credit card number from the new payment_methods table, when all other legacy code is querying the payment_instruments table.
  2. We added a payment_method_migration_state field to the old payment_instruments table. This field is to indicate whether the old or the new table is the source of truth. We used an explicit field to be the indicator, rather than use an inferred value (for example, by looking at whether the encrypted_credit_card_number field is null in the old payment_instruments table), because an explicit and dedicated indicator of migration status is much less confusing than inferred status, which is usually more error prone because it gives something already in use a new meaning (on top of the original meaning). Also, the explicit indicator serves as a lock a little bit: when a migration is in progress, some update operation should be blocked.
  3. We use both online and offline migration. Online migration: any time a mutation API is called on a payment instrument, such as UpdatePaymentInstrument or PurchaseOffering (with a certain payment instrument), the migration code is triggered and runs in the Web frontend, which insert row to payment_methods table, copy over the encrypted_credit_card_number value, back fill the payment_method_id in the old table and set the payment_method_migration_state. Offline migration: we have a standalone tool running in our datacenter, which go through the existing payment instruments and migration them one by one. The reason we had offline migration on top of online migration was because some customers only used our system very infrequently, such as once every three months. We don't want to wait for three months to migration their data.
  4. Controlled migration at per-customer level. We designed it in a way that we can select a batch of customers to be eligible to do the migration (in both online and offline). In that way, we can start with a very small number (say 100 customers), and expand to 1000, 10000, 10% of the system, then all. We did find some critical bug during the first several small batches.
  5. Due to compliance requirement, we must not keep the encrypted_credit_card_number data on the old table. But we didn't do the deletion until the entire migration is done done. That's because if anything seriously goes wrong, we still have chance (even just in theory) to go back to the old data schema. Actually, we did have some bug which messed up data (putting encrypted_credit_card_number on the wrong payment_method_id) and having kept the old data allowed us to redo the migration correctly. It saved the day.
  6. We made the two new fields on the old payment_instruments table Nullable, rather than a default value, to prevent the data page from rearranging for the existing rows (nearly hundreds of millions of them). For the same reason, when we removed the encrypted_credit_card_number data on the old table, we didn't delete it but set it to an all-spaces string which has the equal width as the original encrypted blob.
  7. During testing, we modified the deployment script to be able to deploy both old and new version of the frontend side by side. Because the AddPaymentInstrument API in the new version will always put data in the new schema. We needed the ability in our test automation to create data in the old schema, in order to test the migration code. This ability is actually not only useful in data migration project, it's generally useful in online services: it's always good to know whether the data created by older version(s) can be correctly handled by the new version.

The above 7 things that we have done right will be applicable to future data migration projects that I will do. #6 (preventing data page from rearranging) may be specific to SQL Server, but its spirit is widely applicable: better understand the underlying implementation of the database system, to minimize the performance hit when migrating non-trivial amount of data or touching a lot of rows.

Besides, two more takeaways of mine are:

  1. Have the right expectation. Data migration will be hard. After spending all the time in design, implementation and testing, the actual migration will also take a lot of time. In our project, we ran into weird data patterns in production that we never thought it would be possible. It turned out to be the result of some old code which is now gone (either retired, or fixed as a bug). In production, we also discovered quite some bugs in our migration code that were hard to discover in test environment. It takes many iterations to discover them, fix, test the fix, roll-out the new bits, resume the migration and discover a new issue. It would be helpful if you could get a snapshot of full production data to test your migration code offline. But in some cases, due to security/privacy/compliance, the data to be migrated must not leave the production data center and sanitizing it will defeat the purpose.
  2. Do not do migration of frontend and database at the same time. If you must abandon both the old frontend (e.g. REST API, Web UI, etc.) and old database, do it in two steps: First, do the data migration. Keep the frontend unchanged to customers, and only change the frontend code under the hood to work with the new database. Second, build a new frontend on top of the new database. For sure the two-steps way sound more costly. But in my experience (I have done both ways in different projects), the two-steps way counter-intuitively will end up more cost efficient, less risky, more predictable and more under control.

Better off Financially Working at Startups or Large Companies?

A recent TechCrunch article said:

"Mathematically, talented individuals are certainly better off financially going into a profession or working at a large tech company, where pay is higher and more secure."

I used to believe the same. However, in the last a couple years after seeing real examples of people I knew in person, I have become not so sure about that. Those examples include (with made-up names and gender):


  • Helen used to work in Microsoft. She joined a not-too-hot startup a couple years ago. The worth of her equity doubled from <$500K to nearly 1M in less than a year.

  • Frank recently took an offer from a local startup in Seattle which offer more than 10% higher base salary than what he earns in Microsoft, let alone stock options.

  • Bob recently told me that he has an offer from a near-IPO company in the Puget Sound area which offers him $200K base salary, which equals to the sum of base salary and annual bonus that he can get in Microsoft.

Financially, they all seem to be better off working for a startup than staying in Microsoft. So, is the TechCrunch article wrong (at least on the "higher" part)? To me, TechCrunch has pretty good creditability on the tech startup matter. The author and editors must have a lot more data, visibility and network resource than I do. So they must have a fuller picture and maybe my data set is too small.

How can I find the truth[1]? What about the Glassdoor model? I am not sure. Glassdoor is kind of close when it comes to finding the median of Microsoft salaries in the Seattle area. But unlike the base salary which is well defined and comparable across the board, the financial return of working at startups is far more complex.




[1] The reason why I am seeking the truth regarding whether mathematically it is better off financially working at a large tech company than a startup, is just for my curiosity ("There are those who seek knowledge for the sake of knowledge; that is Curiosity." -- Saint Bernard of Clairvaux). It's not going to make me neither more nor less lean toward a startup job. In fact, I had an offer from a late stage startup not long ago. I didn't go, though.
[2] I found two posts interesting: Startup employees don’t earn more and Big company vs. startup work and pay

Acknowledge Our Lack of Empathy

I had a woman employee a few years ago. She wasn’t always available in the office. She told me it was because her children: the boy was sick, the girl had to stay at home, she needed to pick them up because the dad couldn’t on that day, she needed to prepare meals, etc.. At that moment, I didn’t have kids. I was married, though. I told her “I can understand”. But later, when I had my own child, I realized that earlier I didn’t understand her situation at all. People who don’t have child just don’t get the kids thing, no matter how sympathetic you are.

I had a woman manager reporting to me. She wasn’t married and had no child. She had an employee who got pregnant. I was having a chat with the manager about how to support a pregnant employee. She said “I can understand”. I told her “No, you don’t understand”. I knew how hard it is, because my wife had just gone through the pregnancy a year ago (side note: later it turned out that pregnancy is a piece of cake compared to the first six month after the birth). The woman manager was a very nice person. It’s just that there is no way one can understand what it is like being pregnant unless you have been there.

I have been having some lower back problem in the last a couple weeks. It’s painful. It takes a lot effort and time to put on socks and shoes, get into and out of my car and I even hesitate to walk from my office to someone else’s just down the hallway. People in the office see that I am in pain and wish me get better soon. When I dropped the marker pen, they picked it up for me. I really appreciate their kindness and understanding. Now I think I really understand what it is like to have lower back problem and will be truly empathetic in the future when my team member has lower back problem too.

Conclusion:

We must have the self-awareness that we don’t really understand a difficulty unless we have been there ourselves. In that case, maybe it’s better to acknowledge our lack of empathy. Rather than saying “I can understand”, we could say “I have never had lower back problem myself, so I couldn’t feel your pain. But I am willing to help. Let me know what I should do/should not do”.

Bowling Is the Worst Team Morale Idea

Over the years, I have taken my teams to, participated or heard of various kinds of team morale events, including: bowling, whirlyball, movie, scavenger hunt, boat chartering, go kart, curling, indoor skiing, day of caring, iron chef, family fun center, pool/billiards, laser tag, kara ok, ...

But not all of them are good for a team morale event. Here is how I define "good":

1. Easier to blend people

Team morale event is a great place for team members to get known with others who they didn't get chance to work closely before. However, people naturally tend to hang out with people they are familiar with, since it's their comfort zone, and stay there. It's more of a concern for software companies, because lots of the engineers are introvert and passive in social. So a good morale event must make it easier and more natural for engineers to switch group.

In bowling, each lane goes on different pace. There is hardly a good timing when two lanes both have finished at the same time to swap people. Plus, someone may hesitate to join the big boss' lane, as they don't want to be interpreted as an ass-kisser. Scavenger hunt isn't good either: we get split into teams and stay with our own team through out the whole hunt.

In whirlyball and curling, switching group is less awkward and less likely to get overly interpreted. People can switch sides before/after a game.

In the day of caring event, introvert people can switch group easily and naturally, too. People in the upstairs room shouted "we need someone to help move the furniture" and a big guy put down his yard work and came to help.

2. Not something that I can do myself

When I choose morale event idea, I prefer things that I can't do myself, due to reasons like affordability, the requirement of minimum number of people, etc. 

For example, I can't do these myself and hence they are better choice for team morale events:


  • Curling: it takes 8 people (4v4)

  • Whirlyball: it's usually 5v5 or 6v6

  • Scavenger Hunt: it's best for 20-30 people

  • Go Kart: the more people the more fun. With 10 or 12 karts on the track, there are a lot of bumping, passing and laughing. 

  • Boat chartering: a family's budget can only get a much smaller boat.


Bowling, movie and family fun center are something I can do with my family during weekends. So they are less preferable as team morale events. 

3. Doesn't need a lot of practice

Pool and bowling aren't good for morale event because it takes quite some practice to perfect the skills in order to truly enjoy the game. It's no fun if my bowling ball always falls into the ditch. Plus, between a newbie and who had played pool/bowling a lot, the skill gap is hard to close in an hour or two. Pool and bowling are only more enjoyable when the players' skills are nearly level. 

Curling is better. Although to perfect in curling it also takes a lot of practice, few of us in software companies had played it a lot before. Most of us are at beginner level. Same to the Go Kart. Few of us are pro racer and everybody can push the gas and brake and steer.

Day of Caring is even better. I learned useful skills in volunteer work. Last time my team was helping an assisted living place. We were cleaning up the yard and also the interior. I painted the walls, which I never did before. I learned some tips from others and now I feel more confident to paint my own place (maybe starting from the garage).

4. Sense of achievement and shared memory

There was very strong sense of achievement in the day of caring event: before we came, the place was in a poor state. Painting has peeled. Weeds were tall. Walkway was muddy. We painted the walls, cut the weeds and bushes and pave the walkway with gravels. When we left, the place looked much nicer. Being able to make a dramatic change in a day, it feels good. Plus, it was something that "we did together". Such memory creates a good bond between the team members. 

5. Not cause to feel intellectually inferior

Most of the morale event types that involve competition, including whirley ball, curling, go karting, bowling, etc., are physical competitions. 

But scavenger hunt is more of an intellectual competition: you need to solve puzzles, think of ways to workaround road blockers, etc.. Losing teams feel they are less smart and less good at problem solving. That hurts our ego because we software engineers are intellectual workers. We are proud of our intellectual horsepower and problem solving skills which is what it takes to pass the interview and get our job. 

So I would avoid any morale event types that are intellectual competition. 

6. No individual ranking 

In events like Go Kart, everybody gets a score. Your score is higher than my score. Mine is higher than his. Haven't we had enough of such in office already: everyone get a number each year; your number is better than my number, so you get more bonus than I do and you get promo while I don't. We have all been sick of ranking in office. Better not have another ranking when we have fun.

Conclusion

Based on the above six criteria, here is how these morale event ideas score (if 'x' is 0, 'v' is 1 and '+' is 2). Not surprisingly, bowling is the worst.

2015-07-team-morale

Make Breaking Changes Without API Versioning

Lots of REST APIs are versioned: Microsoft Azure Service Management API, AWS EC2 API, Facebook Graph API, Twitter REST API, Strip, etc.. The motivation behind API version is obvious: to make it easier to introduce breaking changes without disrupting existing client code. As Facebook's API doc says:

“The goal for having versioning is for developers building apps to be able to understand in advance when an API or SDK might change. They help with web development, but are critical with mobile development because a person using your app on their phone may take a long time to upgrade (or may never upgrade).”

However, API versioning sometimes can incur substantial engineering cost. Take Microsoft Azure Service Management API for an example. Currently it has more than 20 versions, among which the first one, 2009-10-01, was released six years ago and is still in use. Customers' expectation is that as long as they stick with the same version (e.g. 2009-10-10), their client code won't never need to change. Assume there are 1,000 test cases for the Service Management API, therefore in order to deliver on that expectation, every time when Azure is upgrading the Service Management API, theoretically it has to run the 1,000 test cases for 20+ times, one for each version from 2009-10-10 all the way to 2015-04-01! Having a retirement policy like Facebook's will help a bit, but even in Facebook's case, they still have to run all the test cases for 5 times in theory (from v2.0 to v2.4).

The engineering cost problem won't change too much whether the multiple versions are served by the same piece of code/binary, or separated pieces:

  • You may implement the API to only have the outer layer versioned and most of the core/backend code version-less, hoping to save some test needs. But that will actually increase the chance of accidentally changing the behavior of older API versions since they share the same core/backend code.
  • Making the core/backend code also versioned gives a better isolation between versions, but it has a lot more code paths that need to be covered in testing.
  • Forking a maintaince branch for each version may sound appealing since it mostly eliminates the need to run full blown testing for v2.0, v2.1, v2.2 and v2.3 when you release v2.4, since v2.0, v2.1, v2.2 and v2.3 are in their own branches. But on the other hand, applying bug fix across multiple maintenance branch may not be as trivial as it sounds. Plus, when multiple versions of binaries running side by side, data compatibility and corruption problems become real.

I was a proponent of API versioning until I have seen the cost. Then I went back to where we started: what are the other ways to make it easier to introduce breaking changes without requiring all client code to upgrade?

Policy-driven may not be a bad idea. Let's try it on a couple real examples of breaking changes:

  1. Azure Service Management API introduced a breaking change in the version 2013-03-01: "The AdminUsername element is now required". Instead of introducing a new API version, we could add a policy "AdminUsername element must not be empty" and start with "Enforced = false" for every subscription. Once enforced, any API call with empty AdminUsername will fail. Individual subscription can turn on the enforcement by themselves in management portal, or it will be forced to turn on after two years (equivalent to Facebook's two years version retirement policy). Once it's turned on, it can't be turned off. During the two years, there might be some other new features that require the "AdminUsername element must not be empty" policy to be turned on. It's up to each subscription whether they want to delay the work of changing the code around AdminUsername at the cost of delaying the adoption of other new features, vs. pay the cost of changing code now to get access to other new features sooner.
  2. Facebook Graph API has deprecated "GET /v2.4/{id}/links" in v2.4. As an alternative to introducing a new API version, we could add a policy "Deprecate /{id}/links". It would be a per App ID level policy and can be viewed/set in the dashboard. App owner will receive reminders when the deadline of turning on the policy is approaching: 12 months, 6 months, 3 months, 4 weeks, 1 week, ...

The policy-driven approach will have different characteristic than the API versions when it comes to discover the breaking changes. In the API version way, when the client switch to v2.4 in its development environment, errors will pop up right away in the integration test if any hidden client code is still consuming APIs deprecated in v2.4. In the policy-driven way, the developer would need to use a different App ID, which is in development mode, to turn the policy on and run the integration test. I don't see fundamental difference between these two ways. They each have pros and cons. Advancing only one number (the API version) may be less work than flipping a number of policies, but policies give the client more granular control and flexibility.

At the end of the day, API versions may still be needed, but only for a major refactoring/redesign of the API, which would only happen once every a couple years. Maybe it's more appropriate to call it "generations" rather than "versions". Within the same generation, we can use the policy-driven approach for smaller breaking changes.