3 lessons from GCP’s giant f*ck up
Null values, convention over configuration and defensive programming
I am sure by now many of you are aware that GCP deleted a customer account of one of their largest customers by accident, a one-off extremely catastrophic event. This was possibly the worst incident that GCP has had in its entire history.
The gist of the issue was that a value left blank inadvertently defaulted to an automated account deletion after one year.
You can read the details here.
The craziest part about this story is that this was such a rookie mistake for a company that knows how to do software engineering extremely well. After all, Google has some of the best engineers on the planet.
So let’s learn a few things from it, so we never make the same mistakes ourselves
Lesson 1: Beware of null values!
The dangers of having null values are well known. In fact, Golang (the language that Google invented!) won’t even let you compile unless all variables are being used.
Consider this line in a bash script running as sudo:
rm -rf /$CACHEFOLDER
What do you think would happen if for any reason we forget to assign a value to $CACHEFOLDER?
Everything in your root folder would get deleted, every file and every folder with their subfolders as well as any read/write mounted volumes under it - all gone.
This is why in bash it is good practice to begin your scripts with this setting:
set -o nounset
That makes your scripts fail if any value is null, preventing the catastrophe.
Lesson 2: do better convention over configuration
I am a big fan of convention over configuration because it makes the user experience better and potentially safer.
The user experience is better because your users don’t need to be burdened with providing a hundred settings to get things done. And it can be safer because you as a designer of the tool, are likely more knowledgable as to what’s potentially not safe.
In GCP’s case however, the defaults were not safe! A safe default would have been for the account to never expire and autodelete, or even expire and freeze the account vs complete deletion.
My guess here is that someone in Google set this up to create internal test accounts and ensure they were cleaned up after a while. They likely never thought it would be used to create client accounts in this way, or maybe it was just an oversight.
As a designer of a tool it is your responsibility to make the defaults as safe as possible, so this never happens.
Lesson 3: Do defensive programming right
This is the part of this incident that baffles me the most, defensive programming is a widely adopted practice, and I am certain Google does it extensively in their company, but it seems in this case the engineers forgot to look both ways.
Even if the default on account deletion was set to one year for whatever reason, there are so many other ways that this incident could have been prevented:
Ensure the setting does not apply to paying customers
Warn the user of this tool what’s going to happen in big red letters before applying
Send warnings to internal engineers before deletion
Prevent any account being deleted as long as someone keeps paying the bills
Freeze the account for x amount of time before actually deleting it
There are probably a dozen more ways you could have implemented a defence against this happening, but somehow this wasn’t done.
The lesson here being that when you design applications you continuously need to think about worse case scenarios all the time, especially one as critical as this.
Conclusion
I am not going to be too harsh on Google here, incidents like this can happen, and everyone makes mistakes. The important thing is that it seems like they have learned from it, done a post-mortem, and I am sure there is no chance this would happen again. So it makes no sense for you to pack your bags and go to another provider just for this reason - and I say this as a principal engineer in a competing cloud!
Still, this has also taught us that it may be wise to have a multicloud strategy for backups and additional resilience, especially if your business is as critical as the one that was affected.
Shame the lesson was so costly, I hope Google at rewarded them with at least a few years worth of credit in cloud costs!