2007-03-18

Optimizing JNI array access

It's a popular misconception amongst people who don't write Java that JNI is used for performance. The main use of JNI in real life is to get access to native functionality that you can't get in pure Java. Even in such a situation, it's usually preferable to just call out to a separate external binary, because such things are easier to write, easier to debug, easier to deal with portably, can't crash the JVM, and may be useful in their own right.

When you do use JNI, you don't want to waste performance if you can help it, but there doesn't seem to be much written about JNI performance, and it doesn't seem to be well maintained.

Take array access. The JNI book's section on Accessing Arrays describes a number of alternatives for accessing primitive arrays from JNI, and even has a little diagram showing how to choose. The trouble is that it's misleading.

Assuming you can't guarantee that your code won't block, there are two basic choices for primitive array access in Java. To further simplify, I'll only talk about accessing a byte[]; the other primitive types have analogous functions with the obvious names.

GetByteArrayRegion or GetByteArrayElements?
The first alternative is the GetByteArrayRegion/SetByteArrayRegion pair. These functions copy from the Java heap to the JNI heap or back. If you want read-only access, you call GetByteArrayRegion. If you want write-only access, you call SetByteArrayRegion. If you want read-write access, you call both. If you only want to operate on a sub-range of the array, you tell the functions, and they only copy that sub-range.

The second alternative is the GetByteArrayElements/ReleaseByteArrayElements pair. What's the motivation? According to the JNI book, they "allow the native code to obtain a direct pointer to the elements of primitive arrays. Because the underlying garbage collector may not support pinning, the virtual machine may return a pointer to a copy of the original primitive array." So this is an optimization, right? You'll copy if necessary (but then you were going to do that with the first alternative, so you haven't lost anything) and you might be lucky and get direct access. The only obvious catch is that you can't specify a sub-range. So if you do end up paying for a copy and you only wanted a few elements from a large array, you'd paid for a bigger copy than necessary.

You don't have to pay for a copy back in ReleaseByteArrayElements, by the way, if you didn't modify the array: you pass a flag telling JNI whether to copy or just throw away the changes. Of course, this means that your behavior now differs depending on whether you got the original or a copy: if you got the original, you've already have changed array elements before you even call the release function. It's not impossible that this is what you want (because the array is some kind of scratch space that you're happy to corrupt the contents of, say), but this is thin ice.

What really happens
The kick in the nuts, though, is that current Sun JVMs never give you a direct pointer. They always copy. I measured this by instrumentation, and then I checked the source. Take a look at DEFINE_GETSCALARARRAYELEMENTS in jni.cpp in the HotSpot source.

I'm not sure why DEFINE_GETSCALARARRAYREGION (the macro implementing GetByteArrayRegion and relatives) doesn't use sizeof(ElementType). I know typeArrayKlass::cast(src->klass())->scale() and multiplication by a non-constant isn't going to be much more expensive than a shift by a value known at compile time, but it's confusing if nothing else. Presumably it's a historical accident. (Sadly, people outside Sun don't get to see the revision history.)

Still, if you want the whole buffer and can afford to allocate enough space on your native stack, GetByteArrayRegion is the current performance champ. It's also the least code.

What about ByteBuffer?
Newer Java releases have java.nio, and that gives you ByteBuffer for explicitly sharing a byte buffer between Java and JNI code. This potentially lets you avoid all the copying, but it's not necessarily convenient if you're trying to fit in with an older API. If you're implementing a new kind of InputStream or OutputStream, for example, this isn't really an option. But if you could use a ByteBuffer, it may be your best choice.

What does Terminator do?
Right now, we're using GetByteArrayRegion and SetByteArrayRegion. SetByteArrayRegion means we don't pay for a useless copy down of a buffer we're about to overwrite when doing a read(2), and the GetByteArrayRegion requires the least code of the alternatives (if you read my previous post, you'll know that reading/writing the pty isn't performance-critical; writing especially not, because that's mostly small amounts of human-generated input).

The reason I'm writing this post, though, is because I was misled by the documentation and by my own measurements. If rule #1 of optimization is "measure", rule #2 is "know what it is you've measured".

What does the JDK do?
"It depends." There's all kinds of variations. One interesting case is that of FileOutputStream; it uses a fixed-size on-stack buffer, but if your request is too large (larger than 8KiB), the JNI side will malloc(3) a buffer, copy into it, and free(3). Given this, you'd be well advised to keep your Java reads and writes less than or equal to 8KiB, if you don't want to accidentally run into this slow path.

In some circumstances, this would make me even more tempted than usual to use FileChannel.map to mmap(2) rather than read(2), but as long as there's no way to explicitly munmap(2) – you have to wait for garbage collection – this approach has its own problems (especially if you have Windows users). Sun bug 4724038 has a good summary, mentions the only work-around I can think of, and why Sun doesn't think you should use it.

Next time, as promised, I'll bore you with talk of POSIX ACLs.