2006-01-22

How does Terminator know what processes might die on Linux and Cygwin?

If you enjoyed How does Terminal.app know what processes might die? you'll love this. If you didn't, you may as well stop reading right now.

If you remember, I finished my tour of Mac OS options for finding all processes using a particular tty with the following:
The only bummer is that Linux's sysctl.h doesn't include KERN_PROC_TTY, so I guess we'll have to grub around in /proc or call lsof(1) there.
The easier option for Linux, it seemed, was to use lsof(1). It was pretty slow at around 240ms on my Opteron compared to Mac OS' 7ms on my dual G5, but that seemed just about bearable. 300ms seems to be about the point where users at my level of impatience notice a delay.

What I didn't think about, though, is that most machines aren't very lightly loaded Opterons. A Pentium 4 with a lot more processes was regularly taking about 0.5s, which was noticeable. And then the complaints started coming in of times over a second on another Pentium 4.

lsof(1) doesn't scale well.

I knocked up a little Ruby script proof of concept to see how well grubbing around in /proc would work:
#!/usr/bin/ruby -w

if ARGV.length() != 1
$stderr.puts("usage: lsof.rb <absolute-filename>")
exit(1)
end

filename = ARGV[0]

def has_file_open(pid, filename)
Dir["/proc/#{pid}/fd/*"].each() {
|fd_file|
begin
linked_to_file = File.readlink(fd_file)
if filename == linked_to_file
return true
end
rescue
# Ignore errors.
end
}
return false
end

pids = []
Dir.chdir("/proc")
Dir["[0-9]*"].each() {
|pid|
if File.stat("/proc/#{pid}/fd").readable?()
if has_file_open(pid, filename)
pids << pid
end
end
}

names = []
pids.sort().uniq().each() {
|pid|
# Extract the "(name) " field from /proc/<pid>/stat.
name = IO.readlines("/proc/#{pid}/stat", " ")[1]
# Rewrite it as "name(pid)".
names << name.sub(/^\((.*)\) $/) { |s| "#$1(#{pid})" }
}
puts(names.join(", "))
exit(0)

This was significantly faster, and had the advantage of working on Cygwin, which doesn't ship with lsof(1) but does have a sufficiently compatible /proc. Even on Cygwin it was only taking about 70ms. On Linux (on the Pentium 4) it was down around 40ms.

The killer though, that gave me sufficient impetus to actually make the change, is that lsof(1) can hang if your Linux box has a hung mount. Bad enough that it was taking over a second (on the event dispatch thread!), but that it could sometimes just go away and never come back...

lsof(1) doesn't play nice with network file systems.

A quick rewrite of my Ruby in C++ later, and there's no danger of this part of Terminator hanging over a hung mount. We get our result in exactly the form we want in 20ms (on the Pentium 4 Linux machine). The Opteron is down to around 10ms: the same as the dual G5's Mac OS sysctl(3).

Why did I rewrite the script in C++ rather than just call out? No particularly good reason. I didn't really want to start a new process when the user's probably trying to kill something for the same reason that shells tend to have kill(1) built in. But really it came about because I initially thought "there's no reason not to do this in Java", and then realized that, actually, file system access is one of Java's worst foibles.

Maybe I'm overly sensitive about that, given how much of my life is taken up with file system performance, but Java also has huge functionality gaps when it comes to the file system. Don't even talk to me about symbolic links! The C++ was easy, less verbose than the equivalent Java, and roughly as verbose as the equivalent Ruby. A good POSIX C++ binding would have made things even better.

Anyway, the users are quiet again, so I can go back in my box.