Yesterday, I came across a fabulous talk by Bryan Cantrill at GOTO 2017 with the topic of “Debugging Under Fire: Keep your Head when systems have Lost their Mind”. It’s a very interesting and entertaining talk, so I highly recommend to watch it in full.
I never really heard this simple definition of debugging:
Debugging is the process by which we understand the system.
36:19
The part I resonated with the most, and might expand on more in a separate post, is the chapter on “The Art of Debugging”. He goes on to explain, that debugging is not magic, as we sometimes may make it look like, instead it’s about asking questions and actually verifying them, not guessing, but verifying answer those questions!
Debugging is the act of asking questions and answering them, not guessing what the answer is. You are playing Twenty Questions. […] You want to form questions, not hypotheses. […] We should be asking questions, and as those questions give answers, those answers are facts, and those facts constrain hypotheses, and we repeat this process of answers to questions, making more specific questions, more specific answers, more specific questions, more specific answers, and then that hypothetical leap is often not a leap at all; it’s a step across a puddle.
Chapter Start: 38:48
I was shocked to not hear more laughter from the crowd for the great Windows joke. 😄
We do not want to overemphasized recovery with respect to understanding how the system works. We do not want to so believe in recovery that we actually no longer understand the system and that we believe that broken software, can simply be made up for things, by restarting everything all the time. That’s call Windows and humanity did that experiment and it didn’t work.
46:44
Luckily, I haven’t been part of a massive outage. Bryan made a good point about postmortems and why you should write them. One point maybe missing here is, that it can also instruct other engineers on the issue. Just because you know the root cause, doesn’t mean the rest of the team does as well.
It’s so much more important that you completely understand what happened. […] You write it [the postmortem] up, so you completely understand it, that’s why you write it up. The write up is to force complete understanding.
49:08
Just a short post, so you have more time to go and watch the talk in full.