2008-12-20

Will stdio outlive us?

I'm reading "Advanced Programming in the Unix Environment" at the moment. I'll post a review at some point, when I've finished. Right now, I've just reached chapter 5, "Standard I/O Library", and it's reminding me how much I hate stdio.

I realize it's somewhat unfair to criticize an API that's as old as I am. I think much of Unix has the advantage that it was only ever meant to provide solid primitives, and for the most part that stuff's stood the test of time well. Higher-level stuff (by the standards of the time) like C and stdio have fared significantly less well because our expectations of "high level" have changed so much. It's the continued ubiquity of this nasty turdlet that frustrates me; the fact that not only are people still using it, but that I'm still seeing people misuse it in the same ways they were misusing it 20 years ago, when I first encountered it.

getchar
This was actually where my hatred of stdio began, despite the fact that it's not stdio's fault that so much buggy getchar(3)-calling code got written. The interface is pretty sound, in a Unixy way. The mistake people make with getchar(3) is more a reflection of the environment that surrounds stdio: C.

By various accidents, I've spent most of my life programming on platforms whose char is unsigned. If you've led a more normal life, not involving ARM or PowerPC, you may not even have been aware that C's "char" type can be signed or unsigned, at the implementation's whim.

I had to learn about signed char versus unsigned char twenty years ago, learning C on an ARM-based home computer. The trouble was that the world (especially Herb "gobshite" Schildt's world) was full of code like this:

char ch;
while ((ch = getchar()) != EOF) { // BROKEN!
// ...
}

This works fine on signed-char platforms, but not on unsigned char platforms because 255 != -1. The solution is to use int instead of char, at least until you've determined that you're not dealing with EOF. (There were other hacks, but they were bad ideas, and they're the reason why, to this day, I know what character is represented by 0xff in ISO-8895-1, even though I've never seen anyone or anything deliberately use that character.)

Could stdio have fixed this? Not convincingly; taking an int* or returning a struct are the only ideas that come to mind. C might usefully have renamed "char" to "byte" and only offered an unsigned variant. For its part, stdio could usefully have deemphasized character I/O. (More on that later.)

getchar versus getc versus fgetc
A trivial thing, but even as a kid the redundancy of getchar(3)/getc(3)/fgetc(3) and their putting counterparts bothered me. I was born a minimalist. The maybe-macro versus definitely-not-macro distinction seemed particularly unconvincing and ugly. Maybe it had made more sense to offer a maybe-macro in the 1970s, I remember thinking.

But the implicit-stdin/stdout variants bothered me too. One of the things I really liked about Unix was the way it treated the console like just another kind of file (and the convention of using the filename "-" to mean the console; I remember being surprised that fopen(3) didn't return the appropriate FILE* when given that special filename, and disappointed that all programs that wanted to support the convention had to do so manually).

Encouraging people to write code that only works for stdin/stdout seemed like a mistake, even in the years before I met more sophisticated stream abstractions.

getchar versus EOF versus error
Before we leave getchar(3) and friends, it's worth remembering that it returns "EOF" both when you're at the natural end of the file and when an error occurs. This isn't necessarily a problem if you're careful with your idiom, looping until you get EOF and then check ferror(3), but people forget that second part all the time. Or they make a mess out of trying to write something fancier.

I've come to wonder if the seductive simplicity of offering a per-character interface isn't also a mistake. There are relatively few programs whose natural unit of input is the character (rather than the line or block) and in retrospect it's starting to look as if we might be better off if the most notable of those, lexical analyzers for programming languages, weren't character-based. (Treating whitespace as insignificant doesn't seem to suit humans particularly well, and definitely leads to whitespace wars.)

And what is a character, anyway? Sadly, there's precious little support for character encodings in stdio. (Though that's not a criticism of the decisions made in the 1970s so much as a criticism of the continued use of stdio.)

fread and fwrite
Like its character-based friends, fread(3) also fails to distinguish between natural ends of files and errors, with the same consequences for correctness. Relatedly, now we're dealing with more than a single byte at a time, have you ever seen a caller of fread(3)/fwrite(3) check ferror(3) and then check errno for EINTR, assuming that having read/written zero "objects" means they've read/written zero bytes? Is it possible to retry an fwrite(3) at the application level? No. At least, not if you're writing objects larger than bytes. (So if you really must use this awful API, be sure to write n "objects" each 1 byte long, rather than 1 object n bytes long.)

ungetc
It was a few years before I was sophisticated enough to be concerned that buffering wasn't a separate concern, but the ungetc(3) function bothered me from the start. The character I push back doesn't have to be the character I read? I can't push back EOF, but I can push back a character after reading EOF? I might be able to push back more than one character, but not portably, and I can't even query the pushback depth?

FILE* fp
And did it ever bother you that the "FILE* fp" parameter always comes last (which would have been annoying in and of itself)... except where it comes first? And that you just have to remember which functions are which?

fgets
Why do I have to tell fgets(3) how big the line is, before it gives it to me? How would I know? Sadly, fixed limits are a grand old Unix tradition (see also: getcwd(3)). They didn't even really give them up the second time around: if Plan 9 hadn't been so riddled with fixed-size buffers, I wouldn't have been forced to learn Perl so I could write scripts that could cope with arbitrary-sized data. At least Plan 9 had mechanisms for doing the right thing, even if they weren't widely used; compare Brdline(3) and Brdstr(3) and ask yourself why the former even exists (and why the worse function got the better name).

Even if you're happy with fixed-size buffers, the way fgets(3) leaves the trailing newline in the buffer must annoy you. Especially given the asymmetry with fputs(3), and the lack of anything like Perl's "chomp". Close your eyes and tell me that just thinking about fgets(3) doesn't bring to mind the obligatory next line that overwrites the '\n' with a '\0'! And didn't it break your C-programming heart that you had to pay for the strlen(3)? I wince even now.

This is what happens, Larry, when you don't offer improved APIs (let alone deprecate bad ones).

I know we don't use gets(3) any more, but it's a real textbook example of convenience triumphing over quality. "Let's get rid of the FILE* parameter, because we encourage that to be stdin anyway, and while we're at it, let's get rid of the buffer length — even though we've no idea how large the buffer is, or how long the line is that we're about to copy into said buffer." Surely even in the 1970s, that sounded like a bug rather than a feature.

But hey, at least gets(3) chomped the trailing newline for you.

Why can't most people even name a stdio competitor?
The final sentence of APUE's chapter 5 – "Be aware of the buffering that takes place with this library, as this is the area that generates the most problems and confusion" – is almost funny. About the only thing the "FILE*" part of stdio got convincingly right was that the arguments to fopen(3) were simpler and more memorable than open(2), and not having an equivalent of creat(2) is less confusing than having the pair. I also think fopen(3) specifically helped distance stdio from Unix modes and flags and file descriptors, which was convenient for the small non-Unix machines of the 1980s. It meant a whole generation of us grew up with stdio but without the originally underlying layer. That and stdio's ubiquity across non-Unix platforms via the standard C library left stdio unassailable.

The immortal printf(3) family probably helped stdio too, but I'm not complaining about them. There's a (good) reason why they get emulated in every language's library, sooner or later. Some languages' versions even address most of the problems of the originals, while retaining the advantages.