2007-07-04

Still no free lunch: the surprising cost of %

Some performance problems are easy to find. Most aren't. This week I came across one I'd never had found if I hadn't introduced the performance regression myself while trying to improve performance.

The code in question was part of Terminator, breaking up lines into runs of the same style for rendering. I was adding a condition that basically said "don't make any run too long, because it's expensive to render a long string, and that's hugely wasteful if it's just going to be clipped anyway". I keep meaning to look into why that is, but that's not the problem here.

My problem was that when I switched from a magic number to a computed guess for the "sensible maximum run length", my benchmark's run time went up by 30%. I wasn't expecting that the better guess would improve performance, but I didn't think it would hurt it.

I was actually working in Java, but that's irrelevant so to protect Java from random drooling slashdot monkeys who might come across this post, I'll give an equivalent C++ example. Here's the program with a constant:

int f(int n) {
int j = 0;
for (int i = 0; i < 200 * 1024 * 1024; ++i) {
if ((i % 82) == 0) {
++j;
}
}
return j;
}
int main() {
f(82);
return 0;
}

The non-constant version differs only in that "i % 82" is replaced with "i % n". Here are the run times on an Athlon X2 3800+ running Linux, built by g++ 4.1.2 with -O2:

x86, i % n 4.5s
x86, i % 82 0.7s

For comparison, here are the run times on a PowerPC G5 running Mac OS, built by g++ 4.0.1 with -O2:

ppc, i % n 3.7s
ppc, i % 82 0.8s

So those clever RISC chaps aren't getting their free lunch either, and any fix I come up with to help the x86-using 99% is likely to benefit the RISC-using 1% too.

One advantage of running a C++ test is you can easily ask to see the assembler. On x86, in the slow case the problem is that we're doing a divide with "idivl %esi". (There is no "mod" instruction; you do a divide and get the remainder "for free".)

If you're anything like me, you long ago stopped counting the cost of integer operations. I don't think I've paid any real attention since I was last writing assembler, and that was a long time ago now. Even then, it was mostly out of curiosity. Back in those days, processors were simple, and instruction costs were given as cycles per instruction. These days with out-of-order execution, deep pipelines, and multiple heterogeneous back-end units, what you're told instead is the instruction's latency. The relevant entry from AMD's "Software Optimization Guide for AMD64 Processors" says that this instruction's latency is 26/42/74 cycles, depending on whether it's operating on a 16, 32, or 64-bit register. So that's 42 cycles for us.

Funnily enough, division is about as scary as it gets on x86, unless you start messing with the weird vector instructions. Most branches (bogey-men of old) are cheaper than division in latency terms, and I/O instructions are about the only "normal" instructions that have higher latencies. "Replace Division with Multiplication" is considered such good advice it has three separate sections in the aforementioned AMD manual.

In my case, I switched to decrementing a counter (which probably ends up something like "decl %eax", 1 cycle), and got my 30% back. Hard to imagine that we're still doing this kind of thing in 2007, in languages several layers removed from the microcode. At least the continued usefulness of some understanding of the actual hardware you're running on means SQL-wielding managers aren't likely to make us all redundant in the near future!