On Troubleshooting

Imagine a line. At one end – say, far off to your right – the line starts to blur and fade.  Out there is where all the awesome, elite programmers live.  Now take a good, long sprint back down the line, past the midpoint and off to your left.  Not all the way, just a bit. If I’m being honest with myself, I’m at my desk near that point on the line.  Away from the elite and the great and the specialized, near most of the rest of the people who are hacking out a living in this field.  What I’m saying is I’m not a great programmer.  I’m good, but I’m not great.

Except in one area: I’m as good a troubleshooter as anyone I know.  Oh, yeah, ok, there are ones better than me out there.  Lots, probably.  But when it comes to narrowing in on a problem, finding a hidden glitch, I’m pretty great when I put my mind to it. Before the rancid smell of ego overwhelms you, though, let me say one more thing: it’s not because I’m anything special.  It’s not knowledge or ability or intelligence. It’s just that most everyone else handcuffs themselves before they begin.

When something breaks, before you really start trying to understand why, your brain flashes up as good a model of what’s going on as it can manage. It makes a lot of assumptions to do this, things that you’re pretty sure are true, so that you have something to work with.  If an application is locking up every time you try to print, your brain dredges up any facts it can about how the application prints.  It never has enough facts, though, so assumptions form a sort of cartilage between the facts to hold the skeleton of the model together.  Then things go wrong.

Your model is screwed up, I promise you. You’ve missed something, or misunderstood something, or just plain gotten it all backwards.  That’s fine. You needed a starting place, and if you start staking out the really awful ones, you get a sense of the shape of the real problem.  Only people start to get confused over what was a fact and what was an assumption, and if one of those assumptions are wrong, and that assumption is connected to what’s kicking your application in the shin, you’ve effectively lost your way out of the maze.

At some point when you’re solving a really gnarly issue, you’ll hit a wall.  You’ve tried everything sane, you’ve exhausted the rational options and you’ve still gone nowhere.  That’s when you need to take a knife to your assumptions – even the ones that just have to be true, the ones that really should be facts – and start hacking.   Mere Smith blogged about screenwriting yesterday, and repeated one of the most useful lessons in fiction: Kill your darlings. Today, your assumptions are your your darlings, and they need to bleed.

This sounds obvious, right? You’re thinking I’m not really saying anything useful.  I get it. This is really obvious advice.  It’s just that almost no one takes it. Even, and especially, really smart people who know their crap.

I had a problem today when we were trying to deploy the most recent version of the website.  It was dying when trying to generate the spiriting for our images (something about which I understand almost nothing) and rolling back.  But why?  What was dying?

./smartsprites.sh: 8: java: not found

I read this yesterday, after an awful day involving low speed car wrecks and a hundred obnoxious technical problems, and I just thought, “Crap, something is wrong with this sprite garbage that I don’t understand.”  And, since it was the end of the day, I told myself I’d look at it with an awake brain tomorrow and went home.  I got in this morning, looked at the actual shell script that was failing and realized I’d been a moron.  There’s nothing wrong with the sprites.  It’s telling me what’s wrong right there.  It can’t find java.  It’s trying to call a command, and that command just plain doesn’t exist.

And here’s where the confusion sets in.  You think:

  1. I’ve deployed this site 40 times before this.  We’ve never had a problem with java.
  2. Nothing has been deleted or changed on the server.
  3. Maybe the java command is failing and I’m just misunderstanding the error. Because…
  4. I know for a fact that Java is installed.

Only you don’t.  In your bleary-eyed death march the day before, you switched from using Web Server 1 as your deploy target to Web Server 2, which used to just pick up the changes you deployed to 1.  Java is installed: on Web Server 1.  You don’t know crap about what’s installed on 2, because it never mattered until today.

When I brought my sysadmin over to say, “We don’t have java installed on Web Server 2,” he pushed back.

“Of course Java’s installed. It’s installed on every machine.”

This goes back and forth for about a minute.  This is what I mean when I say smart people let their assumptions become shackles.  We’ve got an error. The error says JAVA NOT FOUND and nothing else. You really, really feel like java was installed.  So something is incorrect. Either the error is wrong, or your assumption is off.  Most people, at this point, go with their assumption and test something else.  But it doesn’t cost you anything but a little time to prove it.  Make a run at the problem like you’re totally wrong about this one thing and see what happens.  Either you’re proven right (“Ha! Told you Java was installed!”) or you’re proven wrong and you just won a little bit of the future.

In this case, I had to log into the server and type “java” at the command prompt to prove it wasn’t installed.  If I hadn’t had such an easy way to test that, we’d have been arguing for a lot longer.  Assumptions do their best to keep you from doing things you feel are a waste of time.  It can’t be that, so testing it out would be a massive waste of time.  But when you’re lost in the swamp, you have to stop worrying about wasting time and you have to start hacking down weeds.  Any weeds you can see, even the ones you really suspect will just lead to a dead end.  You left efficiency behind when you got stuck with this problem.  Now you need to be mercenary and be will to turn your knife on anything that might be in your way.

Don’t be afraid to be scattershot. Don’t be afraid to try out things that can’t possibly be the issue.  Eliminate every option.  Remove all possibilities.  Improbable isn’t good enough. Prove it impossible, or mark it as a potential problem.  Don’t feel bad about stabbing your assumptions.  They’re usually asking for it.

This entry was posted in Coding, Doing. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *