2007-04-24

Far-east Asian fonts with Java 7 on Ubuntu

If you read JDK6の新しい国際化機能についての記事, which my Mac reliably informs me means "The article concerning the internationalization function whose JDK6 is new", and you're using Java 7 on your Ubuntu box (as I am), you might have wondered why this guy's fonts are working and yours aren't.

Step 1: Actually install some non-English language support.
Choose "System" > "Administration" > "Language Support" from the menu, and check the languages you're interested in. This will net you all kinds of things (such as spelling checker dictionaries), but most importantly, this is how you make sure you have the right fonts actually installed. By default, if you chose "English" when you installed, you'll only have support for English (though typically, fonts supporting English will support other European languages, even though you won't have stuff like the spelling dictionaries for those languages that explicitly selecting the languages will get you).

Step 2: Tell Java to make use of your new fonts.
In the olden days, when everyone only read and wrote New Jersey Unix English and encoded it in ASCII, fonts would contain a glyph for every one of those glorious 95 printable characters, and maybe one extra for all the unprintable ones. 96 is a nice small number, but when UTF-8 came along, 65,536 seemed like a lot of glyphs. And a German font designer was unlikely to have designed new Hangul glyphs as part of his font, while a Korean font designer probably didn't worry too much about the glyphs needed for English.

The composite font (or "logical font") was invented to make up for this. Pike and Thompson's Hello World or Καλημέρα κόσμε or こんにちは 世界 was the first place I read about this idea. (TeX had used a more general form of the same idea virtual fonts for some time, but they were significantly more complex.)

The idea of a composite font is that it assigns different subfonts (or "physical fonts") to use for different Unicode ranges. So you get your Latin range from this physical font but your Greek range from this physical font and so on. Sometimes these physical fonts are designed to look good together, and sometimes they're not. But an ugly juxtaposition is better than not being able to see the character at all.

In Java, the composite fonts aren't hard-coded. They're defined in (these days) fontconfig.properties files in the JVM's jre/lib/ directory. You can find more details of the actual search in Sun's Font Configuration Files documentation, but you'd expect (especially given the recent announcements) that there would be an Ubuntu-specific file. There isn't, at least not as far as Java 7-b11. I've raised Sun bug 6551584 for this.

So, you need to get yourself a "fontconfig.properties" appropriate to Ubuntu. You can write your own, as I did the day before reading Naoto Sato's post, or you can just copy the one that's package with the sun-java6-jdk package.

Now, as long as you only use one of Java's logical fonts ("Dialog" or "Monospaced" and so on) you'll see glyphs for the far-east Asian languages too. If you use a physical font such as "Lucida Sans Typewriter", you'll only see glyphs for the range that the physical font covers. This is in contrast to Mac OS, where everything just works without configuration, even if you ask for a physical font.

What if my application needs a physical font with fallbacks?
If you use new Font("Verdana"), say, you get just that physical font. You might think there would be a convenient way to say "I'd like a composite font, please, with sensible fallback fonts and this primary physical font", but you'd be wrong. Now, if your physical font is proportionally-spaced, there's good news. There's private, undocumented API in sun.font.FontManager that lets you create a new CompositeFont (which is-a java.awt.Font). The composite font is hard-coded to use the "Dialog" logical font fallbacks, and it's the mechanism used by Swing to fit in with the native platform's font.

It turns out that this API is actually exposed by javax.swing.text's StyleContext class, in the shape of the getFont method. If you call StyleContext.getDefaultStyleContext.getFont instead of new Font, you'll get a composite font. The only thing to watch out for is that StyleContext's so-called "cache" doesn't have an eviction policy. So if you're creating lots of randomized fonts, this might cause problems.

Obviously, returning a composite font isn't documented behavior of this method, but it would hurt Swing to regress, so it's unlikely to be broken. And you can always reflect FontManager.getCompositeFontUIResource if the worst comes to the worst.

What if my application needs a monospaced physical font with fallbacks?
If your physical font is monospaced, you're in trouble. The use of the "Dialog" logical font for fallbacks is hard-coded, so you can't just ask for "Monospaced" for your fallbacks instead. I've tried using reflection on both CompositeFont.replaceComponentFont and FontManager.replaceFont, but without success.

The only reliable work-around I can think of would be to rewrite your text rendering to use multiple fonts, choosing the right one for each run of characters in any given Unicode range. (That is, duplicate the work of CompositeFont in your application's code.) I've raised Sun bug 6551615 for this.

2007-04-03

POSIX ACLs: useless and expensive

I wish I'd not mentioned POSIX ACLs a few posts back, because now I feel compelled to talk about them despite their irrelevance. So I'll keep this short.

I was doing some testing recently that required quite a bit of output, and since Linux does such a good job of caching file system data (both metadata and userdata), I was using "ls -lR" on a tree of about 100,000 files spread over 10,000 directories. Curious as to what it would do, I ran it under strace(1) and found that not only does it call lstat64 for each file, it calls getxattr to request "system.posix_acl_access", and gets back EOPNOTSUPP each time since the file system I'm using doesn't have POSIX ACLs.

The weird part is that if I "apt-get source coreutils" and configure and make, the autoconf cruft seems to decide not to include POSIX ACL support, and I get an ls(1) that runs twice as fast (a consistent 0.8s/0.5s/0.3s real/user/sys time versus 1.8s/0.9s/0.9s). So did someone go out of their way to give Ubuntu 6.10 users a slower ls(1)? (If so, I doubt it was specifically anyone involved in Ubuntu; old-style Debian shows the same behavior.) Why isn't ls(1) using pathconf(3) and _PC_ACL_EXTENDED? Because seemingly, that isn't part of POSIX fpathconf, maybe because the POSIX ACL standard was abandoned before it was finished.

(I also notice that coreutils' ls(1) does lots of small writes to stdout, which is interesting because I didn't notice anything in the source that suggested they were going out of their way to do that. stdio might have been doing that to them, which is a sad thought.)

How come, though NFS has GETATTR and READDIR and READDIRPLUS, Linux only has lstat64 and getdents64? Where's "getdentsplus64"? Where is it also in our C library? Or in our JDK? Is the interface to the kernel/JNI really so unlike the network that it's not worth offering bulk operations? Why does NFS (and CIFS) think that READDIRPLUS-style directory access is important, and many of our applications really want READDIRPLUS-style directory access, but there's no way to transmit that fact through the layers?

The paper Efficient and Safe Execution of User-Level Code in the Kernel by Erez Zadok et al mentions this very example:

We found several promising system call patterns, including open-read-close, open-write-close, open-fstat, and readdir-stat. We implemented several new system calls to measure the improvements. The main savings for the first three combinations would be the reduced number of context switches. The readdirplus system call returns the names and status information for all of the files in a directory. This combines readdir with multiple stat calls. Here we save on both context switches and data copies, because once we get the file names we can directly use them to get the stat information. This is a well-known optimization, and was introduced in NFSv3 [2].

Evaluation
We tested readdirplus on a 1.7GHz Intel Pentium 4 machine with 884MB of RAM running Linux 2.6.10. We used an IDE disk formatted with an Ext3 file system. We benchmarked readdirplus against a program which did a readdir followed by stat calls for each file. We increased the number of files by powers of 10 from 10 to 100,000 and found that the improvements were fairly consistent: elapsed, system, and user times improved 60.6-63.8%, 55.7-59.3%, and 82.8-84.0%, respectively.
To see how this might affect an average user's workload, we logged the system calls on a system under average interactive user load for approximately 15 minutes. We then calculated the expected savings if readdirplus were used. The total amount of data transfered between user and kernel space was 51,807,520 bytes, and we estimate that if readdirplus were used we would only transfer 32,250,041 bytes. We would also do far fewer system calls-17,251 instead of 171,975. This would translate to a savings of about 28.15 seconds per hour. Although this savings is small, it is for an interactive workload. We expect that other CPU-bound workloads, such as mail and Web servers, would benefit more significantly from new system calls.

If I had my own Linux monkey, he'd be working on this right now. Not because I wouldn't rather have DTrace for Linux or ZFS for Linux, but because this sounds easy.

I suppose I should be thankful that at least where file system access is most expensive, NFS's READDIRPLUS is stopping the other layers from screwing me over too badly. Talking to the kernel isn't free, but at least my kernel's not a WAN away.