The fear of failure in Engineering

Henry Gallert
4 min readMay 12, 2021

In most organizations that I’ve been working at, I have seen software engineers fear making mistakes. None of these organizations have worked on something that people’s life depended upon.

It’s an accepted mental model that we have to avoid mistakes at all costs.

It is a paradox to judge people by the number of mistakes they made and still expect them to move quickly and make bold moves.

I think the currency of MISTAKES is wrong, and LEARNINGS are the better currency.

Photo by Tim Mossholder on Unsplash

Failure or mistake in itself is not a bad thing. It's how we deal with it.

Example

We deleted some records that were still in use during a clean-up activity of some old database entries.

Context

Our product has evolved over time and so has its database. We had a bunch of products in our database that no one used anymore. Obviously, we attempted to clean them up, and as it seems to be safe to delete them, we went ahead.

Naturally, we created a script, followed our standard procedures of code reviews, CI/CD. Nothing signaled an error. So we deployed the change, ran the script and deleted the obsolete data from production. Our monitoring looked as expected and we went on.

After a short while, we noticed a change in the purchases that happened. The distribution of which products people subscribe to was different.

It turned out to be related to the database’s deleted records, making our mobile client pick a fallback product. The client didn’t crash, crash, but instead, it behaved differently (unwanted) from our perspective. The customers probably didn’t even notice the difference.

The engineers who did the change quickly came together to recover to the desired state.

That’s where we as leaders can make a difference.

They contacted me to support them in accessing a database backup. The plan was to recreate the deleted records by copying them from the backup.

Instead of asking how it could happen that they deleted data from production that is needed, I’ve asked what they need to recover to the desired state. They already owned the issue and were fully committed to fixing it. They probably felt enough discomfort in being exposed by the mistake they made.

If MISTAKES are the currency, it would be one. For the next time, they will try to make it 0 mistakes.

If the currency is LEARNINGS, and we look at this incident, it so much more valuable.

  1. Don’t work on a task of your backlog that you can not really own. This task was a year old. The originator already has left the company and with him the full understanding of the task’s implication. I would have required a full re-refinement of the task to really own it.
    → So you either delete old tasks and if they are important, they will reappear, or you bring them to a stage where you can really own them.
  2. It took too many people to restore data from a backup. It required some people from the DevOps team to load a specific backup into a new database instance. From there one the developers were able to solve the problem on their own.
    → The consequence is to enable everyone inside engineering to restore an old backup and access it to extract some data. We have built a Jenkins job to make that happen at any time it’s required.
  3. We have some cases not tested as part of our CI/CD process crucial to our business.
    → We add them to our regression tests to ensure the expected behavior in the future.

Was the whole thing a success — NO.

Did it yield a couple of improvements and learnings — YES, absolutely.

If I would judge the engineers who did this by the number of mistakes they made, it won’t look good for them. As a consequence, they would act even more carefully for every future action they make. This will make them slow and unhappy.

If I judge them by the level of ownership, they have shown in the way they

  • handled the incident
  • the openness to ask for help
  • admitting a mistake
  • and the active involvement in generating learnings

It does look much better for them.

Yes, they made a mistake. Yes, they learned something from it. Next time, we can recover faster from similar issues. This creates a much more engaging environment that encourages people to take action and recover fast in case of a failure.

I’m pretty confident all of us have made similar mistakes in their life or career. That’s how we learned and gained our experience. Like kids exploring the world and collecting some scratches, they are driven by curiosity and not the fear of failure.

We as leaders need to create an environment that encourages people to learn and grow. I believe one element for such environments is the way how we react to failure and mistakes.

Reassess the recent incidents on how you or your company have handled them. Have you really gained the most out of it? Did you run a blameless postmortem? Did you improve your systems and processes or shared knowledge about the learnings?

If not — prepare yourself to do it next time and react differently.

The key is not to avoid mistakes. It's to recover fast and learn from them.

--

--

Henry Gallert

I’m an engineer at heart and a pragmatic idealist. Director of Engineering @ Freeletics