2008-03-09

Generating JVM bytecode

If you've been wondering where the useful code snippets and hints have been hiding the last few months, the good news is they're back in this post. The bad news is that I'll be talking mostly about objectweb.org's ASM all-purpose Java bytecode manipulation and analysis framework, and don't have any significant experience of the alternatives. When I chose ASM, I'd have loved to read a good comparison of the choices, but I didn't find one. And I'm sad to report that I won't be writing one, at least not right now. Maybe if someone points me to a particularly tempting alternative?

objectweb.org's ASM
I chose ASM because it prides itself on being small and fast, and because it's still frequently maintained. I don't honestly know enough about the competition to know what kind of state they're in, and whether the ASM team's comparative benchmarks are fair and representative. ASM is fast, but I don't know that BCEL (the best-known alternative) would be significantly slower. The documentation for both is pretty weak, but ASM's seems better, and the example code for ASM seems more direct and less verbose.

That said, small does not imply simple, and ASM is a little hairy. The authors went a bit pattern-mad, and it's very "designed" in ways that often seem to be to the detriment of convenience and simplicity. (If you think java.io is awkward, you won't like ASM: it's like java.io, only more so.) In addition, the "convenience" layer in the "commons" package isn't high-level enough that you won't need a good understanding of Java bytecode, but it is far enough from the interface presented by the basic "asm" package (which is a pretty direct mapping), and badly-enough documented that, to be honest, it seems more of a hindrance than a help. I've given up fighting it and am starting to use the underlying functionality instead, even to the extent of reverting some of my existing "commons"-using code. (One problem in particular is that, as in a C++ program that uses stdio and iostreams, you need to be careful that each layer knows what's going on. Now imagine that iostreams was badly documented and seemingly incomplete such that you repeatedly found yourself falling back to stdio. That's the ASM situation. It's possibly nothing that good documentation couldn't fix, but that doesn't help us in the here and now.)

Stick to the basic "asm" package, though, and things don't seem too bad. I do have this sneaking feeling there's an even skinnier guy struggling to get out, though. And I for one would like to make his acquaintance. If you see him, let me know.

Documentation
You'll want the ASM3 API documentation and it's probably worth looking at the asm-guide.pdf overview, though it's incomplete and, if you're interested in generating bytecode, will seem too focused on processing existing bytecode (ASM supports both, and you'll notice that they describe it as a "manipulation and analysis framework" rather than the skinny "generation framework" I really wanted).

More importantly, you'll need to have read The Java Virtual Machine Specification (2e) by Lindholm and Yellin, which is also available on dead tree, but is equally outdated there. The web version would be greatly improved by a table of contents with direct links to the individual pages for opcodes starting with each letter of the alphabet, and even more so from having the material from the JVMS maintenance page worked in to the main text. (Those PDFs suggest Sun has the source to the book, so it's a real shame they don't re-issue the HTML version or an all-in-one PDF version.)

The only other book on JVM bytecode I've read was Joshua Engel's "Programming for the Java Virtual Machine" from the late 1990s. My only retained memory was that it was full of mistakes, and re-reading it recently, I can confirm that to be a reasonable summary of the book. You can safely ignore it, unless you want to see the most amusing typo in any book covering compilers. You can be sure I'll be building a "poophole optimizer" in every compiler I write. (That's the funniest error, but it's also the most trivial. Most of the code examples contain at least one error, and there are numerous statements that make you wonder to what extent this "acknowledged expert in the Java virtual machine" knew what he was talking about. He certainly didn't know how to pick technical reviewers, that's for sure.)

You know what I'd love? Someone like Sun's John Rose to go over the JVMS "Compiling for the Java Virtual Machine" chapter and tell us what kind of code we should be generating for the modern HotSpot JVM. Which idioms and bytecode-level optimizations are good, and which idioms we should avoid. Especially balanced against issues like code size, verifiability, and future-proofing.

Verification
The verifier in Sun's JVM may be great for keeping us (and our programs) out of mischief, but it totally sucks in terms of the quality of its diagnostics. Have a look at OpenJDK's "check_code.c" for the details, but suffice it to say that the errors are lacking in clarity, lacking in detail, lacking in context, and don't even manage to use the usual terminology associated with the JVM and its bytecode. Some messages don't even manage to be valid English sentences. "unitialized" indeed.

Normally this isn't a problem because javac(1) is generating the bytecode and although there have been cases where javac(1) has generated unverifiable bytecode, I can only remember it causing me trouble personally in one instance during the last decade-and-a-bit. If you're writing your own bytecode-generating programs, though, you're a lot more likely to see VerifyError.

Free JVMs
There are a bunch of Free JVMs, and for a laugh I told apt-get(1) to install sablevm, kaffe, and jamvm (in that order, because that was the order in which they came to mind). jamvm(1) turned out to be the most useful, despite the fact (or because of the fact) that it doesn't have a verifier. This, combined with the fact that it seemed to be pretty robust, meant that I could often turn a verification error into a run-time error, and that can actually be useful. The other two weren't much use to me. IIRC, sablevm(1) did verify but with diagnostics no more useful than Sun's, and kaffe(1) mostly crashed on unverifiable code. I've kept jamvm(1) installed because it might come in handy again, but the other two are long gone.

BCEL's verifier
Assuming you get far enough to write a .class file, BCEL includes a fancy verifier called JustIce. I link to a page that proclaims itself obsolete because there's a link on that page to the author's Diplomarbeit, which is potentially useful to you. (It's in English.)

Anyway, apt-get(1) libbcel-java and you can run:

bcel_cp=/usr/share/java/bcel-5.2.jar:/your/class/directory
java -cp $bcel_cp org.apache.bcel.verifier.Verifier YourGeneratedClass

JustIce is pretty verbose, which is mildly annoying when it's shouting "EVERYTHING IS OKAY!", but comes in handy when something's broken. Unlike Sun's verifier, you always get a bytecode index, you'll automatically get a disassembly, and you get plain English explanations using the usual JVM bytecode terminology. (Yes, you'll still have to have a rough idea of what you're doing, but I don't see how that's unavoidable. If you're that lost, you should probably be getting someone else to write your bytecode-generating code for you.)

ASM's verifier
ASM also comes with a verifier, and it's less loquacious, but it's under-documented and it took me ages to work out how to make it work on the byte[] from my ClassWriter, despite it being a mere two lines (the ASM 3.1 JavaDoc actually explains this, but earlier versions don't):

byte[] bytecode = ...
ClassReader cr = new ClassReader(bytecode);
CheckClassAdapter.verify(cr, false, new PrintWriter(System.err));

The beauty of this verifier is that you get nice little stack pictures (that show the local slots too). The disassembly is nice and clear, and the explanations are fine too, and are somewhat less wordy than BCEL's.

(Note that, although I show BCEL's verifier being run as an external program and ASM's verifier being run as part of an application, that's only because I'm using ASM and not BCEL. You could equally well run BCEL's verifier as part of your application and ASM's verifier as an external application, and probably should if you're using BCEL.)

It's a shame that ASM's CheckClassAdapter does so little checking for the kind of errors that actually seem likely. I realize error-checking's not free, but ASM feels like it's sacrificed too much. Mostly you'll just see ArrayIndexOutOfBoundsException and NullPointerException exceptions thrown from its bowels. You'd be well advised to "apt-get source asm3" (I didn't work out how to use CheckClassAdapter before reading the source, for example).

Anyway, back to work. I've got bytecode to generate...