2007-03-18

Optimizing JNI array access

It's a popular misconception amongst people who don't write Java that JNI is used for performance. The main use of JNI in real life is to get access to native functionality that you can't get in pure Java. Even in such a situation, it's usually preferable to just call out to a separate external binary, because such things are easier to write, easier to debug, easier to deal with portably, can't crash the JVM, and may be useful in their own right.

When you do use JNI, you don't want to waste performance if you can help it, but there doesn't seem to be much written about JNI performance, and it doesn't seem to be well maintained.

Take array access. The JNI book's section on Accessing Arrays describes a number of alternatives for accessing primitive arrays from JNI, and even has a little diagram showing how to choose. The trouble is that it's misleading.

Assuming you can't guarantee that your code won't block, there are two basic choices for primitive array access in Java. To further simplify, I'll only talk about accessing a byte[]; the other primitive types have analogous functions with the obvious names.

GetByteArrayRegion or GetByteArrayElements?
The first alternative is the GetByteArrayRegion/SetByteArrayRegion pair. These functions copy from the Java heap to the JNI heap or back. If you want read-only access, you call GetByteArrayRegion. If you want write-only access, you call SetByteArrayRegion. If you want read-write access, you call both. If you only want to operate on a sub-range of the array, you tell the functions, and they only copy that sub-range.

The second alternative is the GetByteArrayElements/ReleaseByteArrayElements pair. What's the motivation? According to the JNI book, they "allow the native code to obtain a direct pointer to the elements of primitive arrays. Because the underlying garbage collector may not support pinning, the virtual machine may return a pointer to a copy of the original primitive array." So this is an optimization, right? You'll copy if necessary (but then you were going to do that with the first alternative, so you haven't lost anything) and you might be lucky and get direct access. The only obvious catch is that you can't specify a sub-range. So if you do end up paying for a copy and you only wanted a few elements from a large array, you'd paid for a bigger copy than necessary.

You don't have to pay for a copy back in ReleaseByteArrayElements, by the way, if you didn't modify the array: you pass a flag telling JNI whether to copy or just throw away the changes. Of course, this means that your behavior now differs depending on whether you got the original or a copy: if you got the original, you've already have changed array elements before you even call the release function. It's not impossible that this is what you want (because the array is some kind of scratch space that you're happy to corrupt the contents of, say), but this is thin ice.

What really happens
The kick in the nuts, though, is that current Sun JVMs never give you a direct pointer. They always copy. I measured this by instrumentation, and then I checked the source. Take a look at DEFINE_GETSCALARARRAYELEMENTS in jni.cpp in the HotSpot source.

I'm not sure why DEFINE_GETSCALARARRAYREGION (the macro implementing GetByteArrayRegion and relatives) doesn't use sizeof(ElementType). I know typeArrayKlass::cast(src->klass())->scale() and multiplication by a non-constant isn't going to be much more expensive than a shift by a value known at compile time, but it's confusing if nothing else. Presumably it's a historical accident. (Sadly, people outside Sun don't get to see the revision history.)

Still, if you want the whole buffer and can afford to allocate enough space on your native stack, GetByteArrayRegion is the current performance champ. It's also the least code.

What about ByteBuffer?
Newer Java releases have java.nio, and that gives you ByteBuffer for explicitly sharing a byte buffer between Java and JNI code. This potentially lets you avoid all the copying, but it's not necessarily convenient if you're trying to fit in with an older API. If you're implementing a new kind of InputStream or OutputStream, for example, this isn't really an option. But if you could use a ByteBuffer, it may be your best choice.

What does Terminator do?
Right now, we're using GetByteArrayRegion and SetByteArrayRegion. SetByteArrayRegion means we don't pay for a useless copy down of a buffer we're about to overwrite when doing a read(2), and the GetByteArrayRegion requires the least code of the alternatives (if you read my previous post, you'll know that reading/writing the pty isn't performance-critical; writing especially not, because that's mostly small amounts of human-generated input).

The reason I'm writing this post, though, is because I was misled by the documentation and by my own measurements. If rule #1 of optimization is "measure", rule #2 is "know what it is you've measured".

What does the JDK do?
"It depends." There's all kinds of variations. One interesting case is that of FileOutputStream; it uses a fixed-size on-stack buffer, but if your request is too large (larger than 8KiB), the JNI side will malloc(3) a buffer, copy into it, and free(3). Given this, you'd be well advised to keep your Java reads and writes less than or equal to 8KiB, if you don't want to accidentally run into this slow path.

In some circumstances, this would make me even more tempted than usual to use FileChannel.map to mmap(2) rather than read(2), but as long as there's no way to explicitly munmap(2) – you have to wait for garbage collection – this approach has its own problems (especially if you have Windows users). Sun bug 4724038 has a good summary, mentions the only work-around I can think of, and why Sun doesn't think you should use it.

Next time, as promised, I'll bore you with talk of POSIX ACLs.

2007-03-12

A modern ha'porth o'tar: the curse of integrated graphics

Being a strong proponent of discrete graphics cards (as opposed to integrated graphics), I often hear the claim that they're "just for gamers". Serious work, it's contended, is all about text, so what would you want to go wasting money on 3D graphics hardware for?

It's hard to get people to spend money on most things, but the silly thing about a graphics card (in contrast to something like a 30" LCD, the importance of which to people who deal with a lot of text all day long is equally poorly understood) is that graphics cards are cheap. And yet people still cut corners.

This post contains an illustration of the effect graphics hardware has on text rendering. If you're already convinced of the importance of decent graphics hardware, read on anyway, because I turn up several other stones too.

Someone complained the other day that Terminator was slow. Running the quick made-up-on-the-spot benchmark of "ls -lR" in a resonably large tree (producing 103200 lines or 7443365 bytes of output), I got the following results on a machine with integrated graphics:

xterm -j 48.907s
xterm -j -s 47.499s
Terminator 41.751s
xterm 41.270s
GNOME Terminal 9.888s
konsole 5.647s

The results were the "real" time measured by time(1), taking one representative run from two or three samples, and the directory tree was in Linux's cache. So although you shouldn't read too much into the numbers, you can see that there's a step change between the XTerm/Terminator results and the GNOME Terminal/konsole results. There are three XTerm results because the -j and -s switches seemed potentially helpful to performance. Other than that, I didn't try to even the playing field at all, and ran the others with their default settings. So Terminator, for example, had infinite scrollback and was logging to disk.

If you'd actually watched me collect these work results, you'd have seen XTerm flickering its little heart out to draw everything. GNOME Terminal and Konsole both appeared very leisurely, and then had a visible "snap" where they went from scrolling to just showing the bottom lines of output. Terminator, seemingly like XTerm, tries to draw everything.

What about on a similar machine, but with decent graphics hardware: the mid-range (and falling) 7600GT? Again, this test isn't terribly scientific; this was a different Linux distribution on a different machine with a different processor, but the relative times within this bunch of results are again quite revealing:

Terminator 8.233s
GNOME Terminal 7.736s
xterm -j 4.902s
xterm -j -s 4.856s
xterm 4.501s

This is a machine broadly comparable in power, running a similar test on a smaller tree (75401 lines, or 4422053 bytes), but what a turn-around. Now everyone's pretty fast, but XTerm is fastest. (XTerm's potential optimizations still hurt it, though.) GNOME Terminal's clever "don't draw everything" optimization doesn't work as well here, and its use of anti-aliased text probably costs it something. Terminator does okay, especially considering that it hadn't warmed up. A second "ls -lR" in each of the slowest two got these results:

GNOME Terminal 7.918s
Terminator 7.391s

We see that GNOME Terminal slowed down, perhaps because of scrollback (though maybe just noise), and Terminator sped up.

So the simple question "is Terminator slow?" turns out to have a fairly complicated answer if you look too closely. People with integrated graphics who compare it to XTerm would say "no". People with integrated graphics who compare it to GNOME Terminal would say "yes". But people with discrete graphics would have the opposite answers. And people who actually use Terminator (and have thus had HotSpot optimizing the code all day) will probably have a different feeling to someone who starts Terminator, runs a benchmark, and quits. Performance testing is always hard, but Java makes it especially hard.

Is this a useful benchmark? Maybe yes, maybe no. On the one hand, most of a developer's scrolling output is probably build output, and the terminal emulator probably isn't the gating factor there. But there are plenty of instances where you have a lot of text appear suddenly. Verbose C++ compiler errors, tar(1) output, and the Bash man page all find their way to my screen fairly frequently.

But this isn't really about benchmarking terminal emulators. It's about text rendering performance, and I just happen to be talking about it in the context of a terminal emulator. Timing of text rendering in Evergreen showed me long ago that decent graphics hardware is, as you'd expect, even more important for text editors as it is for terminal emulators.

An interesting result I didn't expect was that XTerm would fare the worst on crappy "old-style" graphics hardware. GNOME Terminal shows that you can optimize for integrated graphics, at the cost of disconcerting apparent sluggishness, sudden snaps, and increased complexity. But how interesting that GNOME Terminal rather than XTerm should be the terminal emulator optimized in this way. Not what I would have guessed.

The other aspect of Terminator's behavior that this explains is that some users think Terminator's great because when they hit interrupt, it responds immediately. (Apple's Terminal is notorious for allowing the program to sail happily on.) Other Terminator users say that Terminator behaves more like Apple's Terminal. The difference? Happy users have discrete graphics, unhappy users have integrated graphics. It's not the response time that's different: it's that integrated graphics users still have a rendering backlog to get through after the interrupt.

As a developer, whether or not you should rewrite your code to work better on crap hardware is up to you. How bad is performance for realistic uses? How much of your audience has decent graphics hardware anyway, and thus won't benefit? (According to Jon Peddie Research, reported by Ars Technica, integrated graphics make up 62.5% of the market, though that's not necessarily 62.5% of the market for your particular product.) Is there something more generally useful you could be spending your time and effort on that would be more worthwhile? How long until Vista means that even integrated graphics doesn't suck? And don't think that integrated graphics is as bad as it gets: what if you have users who're using VNC, or X11? Or what if you're writing an application that would be really useful on something like the OLPC laptop? There will always be resource-constrained hardware. That today's "resource-constrained" would have seemed a pretty hot machine 20 years ago doesn't mean a thing: our standards have improved since then. So, as ever, you have to decide what makes sense for your application and its users.

As a user, you should get a graphics card. I score pretty high on the "software survivalist" scale, but even I don't write all of the applications I use! And even if, say, your terminal emulator copes well with weak hardware, that doesn't mean your text editor does, or your mailer or your web browser or whatever. You should try to get one that has an appropriate amount of memory for the resolution you want to use, or your cheapo card is likely to fall back to scavenging system memory in the same way that integrated graphics does (an important cause of poor performance). You also need to think about drivers. ATI drivers for Linux made enough people sad that I don't know anyone who'd use an ATI card. (Maybe AMD will do something about that.) Intel, at the other end of the spectrum, pay famous X11 old-timers to write open-source drivers for their hardware. But their hardware blows. (Maybe Vista will do something about that.) So in the meantime, you're stuck with Nvidia. Here's a random test (just to show ratios again) running a very low-end Nvidia card (a 6200) using the reverse-engineered open-source "nv" driver compared to the binary blob "nvidia" driver:

"nv"
Terminator 18.557s
xterm 16.566s
GNOME Terminal 4.649s

"nvidia"
Terminator 11.226s
xterm 4.292s
GNOME Terminal 4.125s

So an application like XTerm that relies on hardware that supports its operations gains greatly from using the non-Free driver, an application like GNOME Terminal that works around weak hardware sees very little improvement, and an application that relies on hardware that supports operations this hardware isn't capable of... well, that still sees an improvement, but it's not as impressive as XTerm's improvement. So bear in mind that what card is suitable for you depends on what you're doing.

[Thanks to James Carter for reminding me of these important considerations. Some of the wording in the section about what you should do "as a user" is his, though he'd probably make a stronger argument in favor of Intel's integrated graphics. Personally, I dream of the day that Intel give us a discrete graphics card with open-source drivers. I'll post the pictures of my Nvidia cards smashed to smithereens, as soon as I've switched. At the moment, though, Intel don't sell anything that can drive my 30" display, so although I like Intel philosophically, I can't use their stuff.]

I realize that I promised this post wouldn't be solely an attempt to persuade you that graphics cards are important, even if all you do is text, but I've run out of time and I'm only half-way through my notes. And since this is a great natural break, you'll have to wait for the rest, in which I'll talk about coreutils, POSIX ACLs, JNI, and HotSpot.