I was doing some testing recently that required quite a bit of output, and since Linux does such a good job of caching file system data (both metadata and userdata), I was using "ls -lR" on a tree of about 100,000 files spread over 10,000 directories. Curious as to what it would do, I ran it under strace(1) and found that not only does it call lstat64 for each file, it calls getxattr to request "system.posix_acl_access", and gets back EOPNOTSUPP each time since the file system I'm using doesn't have POSIX ACLs.
The weird part is that if I "apt-get source coreutils" and configure and make, the autoconf cruft seems to decide not to include POSIX ACL support, and I get an ls(1) that runs twice as fast (a consistent 0.8s/0.5s/0.3s real/user/sys time versus 1.8s/0.9s/0.9s). So did someone go out of their way to give Ubuntu 6.10 users a slower ls(1)? (If so, I doubt it was specifically anyone involved in Ubuntu; old-style Debian shows the same behavior.) Why isn't ls(1) using pathconf(3) and _PC_ACL_EXTENDED? Because seemingly, that isn't part of POSIX fpathconf, maybe because the POSIX ACL standard was abandoned before it was finished.
(I also notice that coreutils' ls(1) does lots of small writes to stdout, which is interesting because I didn't notice anything in the source that suggested they were going out of their way to do that. stdio might have been doing that to them, which is a sad thought.)
How come, though NFS has GETATTR and READDIR and READDIRPLUS, Linux only has lstat64 and getdents64? Where's "getdentsplus64"? Where is it also in our C library? Or in our JDK? Is the interface to the kernel/JNI really so unlike the network that it's not worth offering bulk operations? Why does NFS (and CIFS) think that READDIRPLUS-style directory access is important, and many of our applications really want READDIRPLUS-style directory access, but there's no way to transmit that fact through the layers?
The paper Efficient and Safe Execution of User-Level Code in the Kernel by Erez Zadok et al mentions this very example:
We found several promising system call patterns, including open-read-close, open-write-close, open-fstat, and readdir-stat. We implemented several new system calls to measure the improvements. The main savings for the first three combinations would be the reduced number of context switches. The readdirplus system call returns the names and status information for all of the files in a directory. This combines readdir with multiple stat calls. Here we save on both context switches and data copies, because once we get the file names we can directly use them to get the stat information. This is a well-known optimization, and was introduced in NFSv3 .
We tested readdirplus on a 1.7GHz Intel Pentium 4 machine with 884MB of RAM running Linux 2.6.10. We used an IDE disk formatted with an Ext3 file system. We benchmarked readdirplus against a program which did a readdir followed by stat calls for each file. We increased the number of files by powers of 10 from 10 to 100,000 and found that the improvements were fairly consistent: elapsed, system, and user times improved 60.6-63.8%, 55.7-59.3%, and 82.8-84.0%, respectively.
To see how this might affect an average user's workload, we logged the system calls on a system under average interactive user load for approximately 15 minutes. We then calculated the expected savings if readdirplus were used. The total amount of data transfered between user and kernel space was 51,807,520 bytes, and we estimate that if readdirplus were used we would only transfer 32,250,041 bytes. We would also do far fewer system calls-17,251 instead of 171,975. This would translate to a savings of about 28.15 seconds per hour. Although this savings is small, it is for an interactive workload. We expect that other CPU-bound workloads, such as mail and Web servers, would benefit more significantly from new system calls.
If I had my own Linux monkey, he'd be working on this right now. Not because I wouldn't rather have DTrace for Linux or ZFS for Linux, but because this sounds easy.
I suppose I should be thankful that at least where file system access is most expensive, NFS's READDIRPLUS is stopping the other layers from screwing me over too badly. Talking to the kernel isn't free, but at least my kernel's not a WAN away.