Monday, June 14, 2010

Two things in Java regex

1. Java Regex "eats" characters: a regex runs from left to right, and once a character has been matched and used, it can't be reused any more. For example, the following codes demonstrate that:

Pattern pattern = Pattern.compile("aba");
Matcher matcher = pattern.matcher("abababababa");
while(matcher.find()) {
System.out.print(matcher.start() + " ");
}

output: 0 4 8

2. Greedy and Reluctant Quantifiers: Greedy quantifier tries to search for a possibly maximum match against a string, for example in "abababababa", the pattern "[\\w]*aba" will find "abababababa". It does in fact read the entire source string and then works backwards until it finds the rightmost match.

On the other hand, reluctant quantifier will find as many match-ups as possible. For the same source string, the pattern "[\\w]*?aba" will get "aba" "baba" "baba". Here is the demonstration:

Pattern pattern2 = Pattern.compile("[\\w]*aba");
Matcher matcher2 = pattern2.matcher("abababababa");
while(matcher2.find()) {
System.out.println(matcher2.group());
}

output: abababababa

Pattern pattern3 = Pattern.compile("[\\w]*?aba");
Matcher matcher3 = pattern3.matcher("abababababa");
while(matcher3.find()) {
System.out.println(matcher3.group());
}

output: aba baba baba

will post a separate page about Possessive quantifier later.

No comments: