Crash-only software

I read "Is your software crash-only?" today, and the paper it links to. As a file system developer by day, the philosophy is a familiar one. A lot of the code I write deliberately forces crashes. Crashes aren't inherently evil. They can protect data, they can increase availability, and they can make it easier to fix root causes.

I remember, back in the days before Java, that one of the things I most hated was when a C program would get SIGSEGV and just disappear, taking my work with it. Java, as I experienced it, was wholly different: an exception would propagate up the call stack, but the program wouldn't terminate. It would just print a stack trace and carry on. Often, you'd just be able to ignore what had happened. Other times, you could save what you were doing and restart. In the worst cases, you could often save much of what you'd been doing and limit your losses to some specific failed part.

Not crashing, I felt, was a major step forward.

Then I left application development in Java and went back to file system development in C++, where the sooner I can crash the server when things start to go wrong, the sooner it will recover.

In my work, in C++, assertions are a fundamental part of what I do. Whenever I write a comment, I ask "can I rewrite this as an assertion?", and if I can, out goes the comment and in goes an assertion. In my spare time, in Java, I've never written assertions. I still don't. (I've written exactly one assertion in Java. I can't remember where it is, or what it asserts, but I remember it was in the next method I wrote after the last time I noticed this discrepancy.)

The main reason I don't write assertions in Java is because I have this idea that Java programs don't give up; they struggle on.

These two philosophies aren't as different as they might seem; they both aim to avoid losing the user's work. The difference is just an implementation detail in how they go about ensuring that they lose nothing, with one choice ("crash-only") being harder to program than the other ("struggle on"). The funny thing is that although we've long recognized the idea of recovery in file systems and databases, to the extent that we now expect them to feature recovery, we don't have the same expectation of applications.

Why not?

Apple's iCal, iTunes and Address Book don't have any explicit "save". (This was particularly useful in the case of Address Book because in the early days it crashed all the time.) Using these "save-less" applications convinced me to make a similar effort in my own applications. I don't just mean taking care of things the user has explicitly input (i.e. typed); I mean things such as what files were open, and window locations and sizes too. This becomes a minor feature in itself, even if the program in question doesn't crash. It means I can log out or reboot quicker; it means I can recover from a power cut better; it means I can upgrade to a new version in the middle of working on something.

The more I use programs with that kind of behavior, the more I'm annoyed by programs that make me manually save and restore state. Worse still are those programs where you can't even do the job yourself. Safari and Camino (the two main Mac web browsers) don't crash often, but when they do, they take a bunch of windows with them. And if those windows were open, it was for a reason. That was my input, and the program I gave it to didn't take sufficient care of it, and now it's lost. (I believe, from an Ars Technica review, that the Opera browser does remember window URLs, locations, and scroll positions. But why don't all browsers?)

Another example is the BlackBerry. A reasonable bit of hardware ruined by terrible software. It's lost so many mails I was typing I only use it as a read-only device these days. I don't trust it, and it's unlikely ever to regain that trust. Even though I'm told that more recent software crashes less, I won't even give it a try unless it starts automatically saving what I'm doing, and automatically restores it after a watchdog-invoked reboot. (Booting in less than 2 minutes would be important, too.)

Trust is a very difficult thing to win back. The best way I've seen for applications to do this is by being able to return to the exact state they were in before they crashed or were quit. The Mac OS X Apple Software Design Guidelines don't mention this, not even under "Reliability", but they should.