Attack of the Mac Python Zombies!

Nice title, huh?

I choose my titles in the same way I choose what to write about. The title's what I typed in to Google when I faced my most recent problem, and I'm writing this post because Google didn't furnish the answer until after I'd already solved the mystery myself.

Okay, so I didn't actually search for "Attack of the Mac Python Zombies". The first bit's just a bit of sensationalism to hook those of you who, like me, don't care about (or for) Python and might otherwise miss out.

My problem was the error message "fork: resource temporarily unavailable". I'm pretty sure I haven't seen that since university, a decade ago, when someone accidentally or deliberately (always hard to tell) fork-bombed a shared machine. It was a bummer in those days because even if you were the culprit, you couldn't necessarily fix things because your shell didn't necessarily have a "kill" builtin, and if you tried to fork kill(1), well, "resource temporarily unavailable". These days, the shell wars are over, and 'most everybody runs bash(1).

I tell a lie: my problem was that opening a new window in Terminal.app would say "Completed Command" in the title bar, and produce no output at all in the window. I'd never seen this before, and didn't understand right away what was going on. I tried opening a few more, leaving the existing ones open, because I've had trouble with permissions on pseudo-terminals before, but that didn't help. I also thought trying again might help because every few thousand terminal windows, bash(1) crashes on start-up. (I long thought that was a Mac OS problem, but I've seen it much less frequently on Linux too, so now I'm not sure.) Luckily, I already had a couple of shells, and typing "ps aux" in one of them showed me the real problem. "fork: resource temporarily unavailable".

If I'd known at this point what I knew by the end of my investigations, I might have logged in as my alter-ego, "Test Monkey", but I didn't realize that the problem wasn't the overall number of processes: it's that Mac OS has a per-user limit. All Unixes I've met do, but I can't remember the last one that didn't have the limit set to "unlimited" by default. Here's Ubuntu 6.06's default limits, for example:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
max nice (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
max rt priority (-r) unlimited
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

And here's Mac OS 10.4.6:

core file size (blocks, -c) 0
data seg size (kbytes, -d) 6144
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 100
virtual memory (kbytes, -v) unlimited

At this point, if you have anything to do with Mac OS machines, it might be worth committing to memory for future use that not only does Mac OS have a low per-user process limit (100 processes/user), it has a low data seg size (6144 KiB), a low number of open file descriptors per process (256), and a small pipe size (512 B). [I'm not sure the "data seg size" matters in practice because that, as I understand it, is just the limit on the portion of the heap allocated with sbrk(2), and doesn't count heap allocated with mmap(2). Google doesn't seem to know why Apple chose this 6 MiB limit.] Jonathon Arnold points out that Mac OS also has a low default limit on the amount of shared memory (4 MiB); see, for example, here for details.

So: Terminal's broken because we've hit our 100-process limit. I didn't realize this until I'd already killed a few things to give me room to maneuver, sadly. I was now at a point where I could see plainly with ps(1) that my problem was that the system had a couple of hundred zombie processes ("Z" in the "STAT" column) with the name "(python)". So something was running python(1) but not waiting for the child. And that something was still running. And, judging by the fact that my girlfriend was close to hitting her 100-process limit, it was something that we both run.

This, I thought, was handy in that it meant I could rule out all the stuff related to software development that I run. As it turned out, that was a completely false conclusion.

I methodically quit all running GUI applications, checking with ps(1) each time to see if the zombies had been reaped. (Kill the process that the kernel thinks should wait, and init(8) realizes no-one's going to wait and waits for the zombies, putting their souls to rest. So if quitting an application had caused my zombies to disappear, I'd know it was that process' fault.) After eliminating all the GUI applications as possibilities, I started killing the likes of AppleSpell and ATSServer. I got right down to having killed everything but loginwindow, connected from the Ubuntu box and killed loginwindow.

Still my zombies remained.

I was a bit confused at this point, but got my girlfriend to log out, just to see that the same was true for her. It was: logged out, she still had nearly 100 "(python)" zombies.

The good news at this point was that there was little left running. I'd eliminated most of the possibilities, and one of the few processes I didn't recognize was "/System/Library/PrivateFrameworks/DedicatedNetworkBuilds.framework/Versions/A/Resources/bfobserver". I killed that, and all the zombies on the system vanished, and various things that had got stuck for lack of the ability to fork came back to life.

Searching Google for "bfobserver" (which I had read as "b-fob-server", but now realize is "b-f-observer", damn those C programmers!) "python" and "zombies" returned one match: a Google groups thread python at login on macintels that was exactly the same problem.

Basically, the new Distributed Builds stuff in Xcode 2.3, without you having activated it in any way or even run Xcode, runs a Python script "/System/Library/PrivateFrameworks/DedicatedNetworkBuilds.framework/Resources/sctwistd" and forgets to wait for it. It seems to do this every time you log in, and that includes fast-user-switching logins. That took us just under 20 days to reach our 100-process limits, so I expect a rash of problems soon for developers who don't reboot unless forced to, but who share a machine. If I'd installed the latest QuickTime update (which probably just gives me more DRM crap I don't want, and for that it wants me to reboot?) I'd have put it off for another 20 days.

I'm annoyed by Terminal.app's poor diagnostics. I'm annoyed by Activity Monitor's failure to indicate any problem (it doesn't show zombies, even when you've got nearly 200 of them). I'm annoyed by DedicatedNetworkBuilds' bug, and I'm extra annoyed that the Xcode installers means I have that running without having opted-in, or without any likelihood that I'll ever be in a position to make use of it. It's probably jolly useful inside Apple, but when am I going to have enough Apple machines to be able to use it? I don't have enough to use distcc(1), which they still recommend for smaller builds. Why do I have to install all the Xcode crap just to get the latest GCC and weird Apple binutils replacements anyway? Why aren't Apple's packages more transparent? "Click OK to run arbitrary code as root."

On the bright side, I'm thankful I'm running some variant of Unix. I told my girlfriend this is what's so great about Unix: you can do better than just throw your hands up in despair and reboot/reinstall/get a new computer.

She was so impressed.