2007-01-29

What do the anchors ^ and $ mean in a regular expression?

One of the least great things about regular expressions is that there's no One True Standard. Sure, there's POSIX, but it's both broken and limited, so no-one really happily uses that. There's Perl, too, and in fact most modern regular expression implementations imitate Perl reasonably well. There's often trouble with things like Perl's embedded code syntax, but that's fair enough: you can't realistically expect every regular expression library to include a Perl interpreter, and most people wouldn't want such a thing even if it were available.

Java's documentation for java.util.regex.Pattern even contains a "Comparison to Perl 5" section, but it's sadly incomplete. It doesn't mention minor bugs such as the treatment of # in a character class (despite it being a long-standing bug and there being no obvious intention of actually fixing it).

Usually, though, it's fairly esoteric stuff that differs between implementations. The basics are the same everywhere. Or so I'd have said yesterday. In particular, I thought "what do the anchors ^ and $ mean in a regular expression?" was a particularly easy question; one that even a regular expression beginner could answer completely.

So. What do the anchors ^ and $ mean?

Let's start with Perl. You'll have to forgive my dialect; I left long ago and have rarely ventured back since (and, as it turns out, my first attempt was wrong and had to be corrected by Chris Reece). If we give Perl a string with a bunch of newlines and ask it how many times $ matches, we get the perfectly sensible answer "once":

$ perl -e '@m = "a\nb\nc\nhello\nworld\n" =~ m/$/g; print scalar(@m)."\n";'
1

By default, $ matches end-of-input, and there's only one end of input.

If we give Perl the same string and ask it how many times $ matches in multiline mode, we get this answer:

$ perl -e '@m = "a\nb\nc\nhello\nworld\n" =~ m/$/mg; print scalar(@m)."\n";'
6

To me, this doesn't match the documentation. perlre(1) says:

m Treat string as multiple lines. That is, change "^" and "$" from
matching the start or end of the string to matching the start or
end of any line anywhere within the string.

Seemingly, that should be "...to also matching the end of any line anywhere within the string".

Moving on to Ruby, we see the same result. There isn't, as far as I know, any canonical documentation for Ruby's regular expression syntax, but the on-line "pickaxe" book says that "$ Matches the end of a line". So, for some reason, Ruby thinks our string has a newline we can't see.

$ ruby -e 'puts("a\nb\nc\nhello\nworld\n".scan(/$/).length())'
6

We can start up irb(1) and see just how convinced it is:

irb(main):001:0> "a\nb\nc\nhello\nworld\n".scan(/$/)
=> ["", "", "", "", "", ""]

Even more curious is that multiline mode seems to make no difference:

irb(main):002:0> "a\nb\nc\nhello\nworld\n".scan(/$/m)
=> ["", "", "", "", "", ""]

Turning to the documentation, we find out why. The definition of multiline mode is given as "Normally, '.' matches any character except a newline. With the /m option, '.' matches any character". So it turns out that in the land of the rising sun, multiline mode means something completely different. Java and Python call this dotall mode, while Perl calls it single-line mode. Everyone but Ruby uses (?s) to control this (after Perl's "single-line" name).

(If you want start-of-input or end-of-input anchors in Ruby, you need to use \A or \z, which are also available in Java, Perl, and Python. You don't want \Z because \z behaves in the same way as $; in Java, it's even implemented using the same underlying primitive.)

And so to Java:

import java.util.regex.*;

public class test {
public static void main(String[] args) {
Pattern p = Pattern.compile("$");
Matcher m = p.matcher("a\nb\nc\nhello\nworld\n");
int count = 0;
while (m.find()) {
++count;
}
System.err.println(count);
}
}

How often does Java claim that $ matches in our now-familiar string?

2

And in multiline mode?

6

Taking the latter result first, although idiotic, this appears to be the documented behavior: "In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence". Note well that "or" (emphasis mine). This seems to be Python's interpretation of $ in multiline mode also, so that's a full house. I don't know why anyone would want this behavior, but my guess is that they felt that there's an implicit line terminator at end of file (but even if there's an explicit line terminator?).

As for the former result, presumably that's a Java bug? Here's a slight variant that shows where the matches occur:

import java.util.regex.*;

public class test {
public static void main(String[] args) {
Pattern p = Pattern.compile("$");
Matcher m = p.matcher("a\nb\nc\nhello\nworld\n");
while (m.find()) {
System.err.println(m.start() + ".." + m.end());
}
}
}

And here's the output:

17..17
18..18

So we're getting a match on the final line terminator and then another match on end-of-input. If the input doesn't end in an explicit line terminator, Java correctly reports just one match.

I can't find anything in the Java documentation to suggest the two-match behavior isn't a bug, and have filed a bug report already. In the meantime, you might like to try \z if you really just want to match at the end of input.