| | | | | | |

Getting started with jregex

Pattern matching

Pattern searching

Replacing Updated!

Tokenizing Updated!

Filesystem utilities New!

Appendix A: Compilation flags

Appendix B: Using groups

Appendix C: Unicode

Appendix D: Backslash issues

Pattern matching

Pattern matching stands here for testing whether the entire string matches a pattern.

1. Create a Pattern instance:

   Pattern p=new Pattern("\\w+"); //a word pattern

2. Obtain a Matcher for a string:

   Matcher m=p.matcher(myText);

3. Test:

   if(m.matches()){
      System.out.println("The string is exactly a single word!");
   }
   else{
      System.out.println("The string is not a single word!");
   }

Upon a successful match you can retrieve groups(if a pattern contains any) as described in Appendix B: Using groups.

Pattern searching

Pattern searching stands here for searching for non-intersecting occurences of a pattern in a string. This can be done using find() method defined in class Matcher. When called the first time, find() starts searching from the zero position in a target, returning true if an occurence is found. On the following calls it will search starting from after the end of a previous match, making it possible to find all non-intersecting occurences of a pattern.

1. Create a Pattern instance:

   Pattern p=new Pattern("\\w+"); //a word pattern

2. Obtain a Matcher for the string:

   Matcher m=p.matcher(myText);

3. Search:

   while(m.find()){
      System.out.println("next word: ["+m.toString()+"]");
   }

3.1. The same as 3 using MatchIterator paradigm:

   MatchIterator mi=m.findAll();
   while(mi.hasMore()){
      MatchResult mr=mi.nextMatch();
      System.out.println("a word found: "+mr.toString());
   }

Upon a successful search you can retrieve groups(if a pattern contains any) as described in Appendix B: Using groups.

Replacing

The Replacer class lets one to replace all occurences of a Pattern within a string by one of the following:

a plain string

a string with group references

a dynamically generated contents (using a Substitution interface)

The group references could be either numeric ("$N" or "${N}", N is a group number) or symbolic ("${W}", W is a group name).

The most simple usage is like so:

Pattern p=new Pattern("(\\d\\d):(\\d\\d):(\\d\\d)");
Replacer r=p.replacer("[hour=$1, minute=$2, second=$3]");
//see also the constructor Replacer(Pattern,String,boolean)
String result=r.replace("the time is 10:30:01");
//gives "the time is [hour=10, minute=30, second=01]"

You can also to append the result either to a StringBuffer:

StringBuffer sb=...;
Replacer r=...;
r.replace("the input string",sb);
//now sb contains the result of replacement

or to a Writer:

Writer out=...;
Replacer r=...;
r.replace("the input string",out);

or to any TextBuffer instance:

Writer out=...;
TextBuffer tb=Replacer.wrap(out);
//see Replacer.wrap(Writer) and Replacer.wrap(StringBuffer)
Replacer r=...;
r.replace("the input string",tb);

Note that in the Replacer class there are a lot of similar methods for various input types.

If some task requires a flexibility that perl-like substitution expressions couldn't provide, one can use a custom implementation of a Substitution interface. For example:

Pattern p=new Pattern("(\\d+)\\+(\\d+)"); 
Substitution add=new Substitution(){
   public void appendSubstitution(MatchResult match,TextBuffer dest){
      int a=Integer.parseInt(match.group(1));
      int b=Integer.parseInt(match.group(2));
      dest.append(String.valueOf(a+b));
   }
}
Replacer r=p.replacer(add);
String result=r.replace("1+2 3+4");
//"3 7"

String tokenizing

String tokenizing using jregex's RETokenizer class is pretty similar to using a standard StringTokenizer class. The only difference is that RETokenizer uses a pattern occurence as a token delimiter:

   String theText=" Some --- strings --- separated by \"---\"";
   Pattern p=new Pattern("(?<!\")---(?!\")"); //three hyphens not enclosed in quotemarks
   RETokenizer tok=new RETokenizer(p,theText);
   while(tok.hasMore())System.out.print("Next token: "+tok.nextToken());
   //prints:
   // Some 
   // strings 
   // separated by "---"

RETokenizer has a split() method that allows to get all the tokens as a String array:

   Pattern p=...;
   String[] arr=p.tokenizer("input string").split();

There is an important issue regarding how the RETokenizer handles few adjacent delimiters, as it can take them either as a single delimiter or as several ones with the empty tokens between. One can control this behavoiur using the RETokenizer.setEmptyEnabled(boolean) method.

Filesystem utilities

1. The jregex.util.io.WildcardFilter class.

This implementation of the java.io.FilenameFiler inteface lets one to filter files by their names using well-known wildcards, "?" and "*", where the first one matches any-character and the second matches any string. So, the "?or?" matches "wORd" and "mORe", and the "*or*" matches both above plus "transpORtation", "mirrOR", "ORbit", etc. Usage:

   File dir=...;
   String[] htmlFiles=dir.list(new WildcardFilter("*.html"));

2. The jregex.util.io.PathPattern class.

This class has two possible applications:

to search files by path patterns

to match the system-dependent path strings against the system-independent patterns

2.1. File search (the key method: PathPattern.enumerateFiles())

The path pattern can be both relative and absolute and may the following wildcards:

? - any-character

* - any-string

** - any-path

Some examples:

/** - all files and directories under the root;

/**/ - all directories under the root;

/**/tmp/**/*.java all .java files under the root that include a 'tmp' directory somewhere in the path;

** - all files under the current directory, the same as **/*

**/*.java - all .java files under the current directory

*/* - all files and directories that are one level below the current directory

*.j??? - all files in the current directory whose extension consists of 4 chars starting with 'j'

Usage:

   PathPattern pp=new PathPattern("/tmp/**/*.java");
   Enumeratuion e=pp.enumerateFiles();
   while(e.hasMoreElement()){
      File f=(File)e.nextElement();
      f.delete();
   }

2.2. Path string matching (the key methods: Pattern.matches(String), Pattern.matcher(), etc)

As a descendant of the jregex.Pattern, the PathPattern inherits all its functionality, allowing to search and match the path strings.
For example, the pattern */*.java would match the following strings: foo/Bar.java, bar\Foo.java (on windows), and wouldn't match the FooBar, FooBar.java, foo/bar/FooBar.java.
Note, that each wildcard takes a capturing group in the pattern.
Usage:

   String myPath=...;
   Pattern p=new PathPattern("**/*"); //the "**" is the 1-st group, the "*" is the second
   Matcher m=p.matcher(myPath);
   if(m.matches()){
      System.out.println("file name: "+m.group(1));
      System.out.println("directory: "+m.group(2));
   }

Appendix A: Compilation flags

Compilation flags allow to change the meaning of some syntax elements. These flags may be passed to Pattern constructor either as a string containing appropriate characters, or as a bitwise OR of some int constants. These flags are:

int form	string form	if enabled	Default state
REFlags.IGNORE_CASE	"i"	Forces a matcher to ignore case	Disabled
REFlags.MULTILINE	"m"	Forces a '^' tag to match BOLs and a '$' to match EOLs	Disabled
REFlags.DOTALL	"s"	Forces a '.' (dot) tag to match line separator chars	Disabled
REFlags.IGNORE_SPACES	"x"	Forces a compiler to ignore spaces in expression; allows to sparse a pattern for better readability	Disabled
REFlags.UNICODE	"u"	Forces a compiler to treat \w, \d, \s, etc. as relating to Unicode	Disabled
REFlags.XML_SCHEMA	"X"	Enables compatibility with XML schema regular expressions	Disabled

Passing flags through a string looks like "imsxuX-imsxuX" where chars before a hyphen enable appropriate flag, and after a hyphen disable it. Such string you can also embed into a pattern using the "(?imsxuX-imsxuX)" and "(?imsxuX-imsxuX:)" constructs. The first one sets flags for the rest part of the pattern, while the second sets flags for the enclosed part (that resides between the colon and the closing parenthesis).

Appendix B: Using groups

A group is a special construct in regular expressions that looks like a part of a pattern enclosed in parentheses. Also, two conditions must be met: parentheses should not be escaped and the opening one should not be followed by a non-escaped question mark ("(abc)" is a valid group for example). Groups can be nested.

The groups has two functions.

The second function is that upon a successful match the group "captures" ñorresponding part of the input, thus allowing to access this part independently.

This is nearly most attractive feature of regular expressions, because not only we can describe some complex structure and then find the matching substring in a text, but also can immediately retrieve the values corresponding to a parts of this structure.

For example, a pattern "(\d+)(\s+)(\d+)" has three groups, the 1-st capturing some digits, the second capturing some spaces, and the 3-rd capturing some digits again.

To retrieve a contents of a group, we need to somehow address this group. For that sake the groups in an expression are automatically numbered. The numbering starts from "1" for the leftmost opening parenthesis, and the following opening parentheses (and so the whole groups) get their numbers in increasing order.

Now, suppose we have created a pattern with groups myPattern, obtained a Matcher object (myMatcher), and have succeeded with any of find(), matches(), matchesPrefix(), proceed() methods (see searching, matching, incomplete matching, non-breaking search appropriately). Now we can:

find out a number of capturing groups:

   int gc=myPattern.groupCount();
   System.out.println("Group count: "+gc);

test whether some group is captured;

find out where some group starts and ends, how long is it;

retrieve its contents:

                  
   for(int i=0;i<gc;i++){
      System.out.println("Group #"+i+":");
      
      if(!myMatcher.isCaptured(i)){                                 // see
         System.out.println("  Not captured, taking next..");
         continue;
      }
      
      System.out.println("  starts at "+myMatcher.start(i));        // see
      System.out.println("  ends at "+myMatcher.end(i));            // see
      System.out.println("  length: "+myMatcher.length(i));         // see
      System.out.println("  contents: \""+myMatcher.group(i)+"\""); // see
   }

Note that all methods dealing with retrieving information on a match are grouped in MatchResult interface, which is implemented by a Matcher class.

Example
Suppose we have a string myString (which actually is "The time is 15:20:45"), and we suspect it contain a time in a "hh:mm:ss" format. And if so, we want to know the minute. Let's begin:

Pattern hms=new Pattern("\\b(\\d\\d):(\\d\\d):(\\d\\d)\\b"); // "\\b" tags a word boundary
Matcher m=hms.matcher(myString);
if(m.find()){
   System.out.println("Found!");
   String grp2=m.group(2);
   int minute=Integer.parseInt(grp2);
   System.out.println("The minute is "+minute);
   //prints "The minute is 20"
}
else{
   System.out.println("Not found :((");
}

Appendix C: Working with Unicode

Unicode characters

jregex library supports Unicode both in patterns and in targets. For example, a pattern "\u0430" compiles into a pattern consisting of a single character (a cyrillic 'a') in its literal meaning, which will match the corresponding character '\u0430' in a string.

A case insensitivity is also handled, so if a pattern is compiled with REFlags.IGNORE_CASE("i"), all characters will match their upper, lower and title case variants.

Example: a pattern "[a-c\u0430-\u0432]" compiled with REFlags.IGNORE_CASE will match the following letters:

'a','b','c',
'A','B','C',
'\u0430','\u0431','\u0432',
'\u0410','\u0411','\u0412'

Unicode classes

The library has a number of predefined character classes dealing with Unicode blocks and categories. These classes have the following syntax:

\p{Name} - a positive class - characters belonging to specified block or category;

\P{Name} - a negative class - characters not belonging to specified block or category;

The Name is one of the following:

Unicode categories: Cn, Cc, Cf, Co, Cs, Lu, Ll, Lt, etc...;

Unicode blocks: isBasicLatin, isLatin-1Supplement, isLatinExtended-A, etc...

POSIX classes: Lower, Upper, ASCII, Alpha, Digit, Alnum, Punct, Graph, Print, Blank, Cntrl, XDigit, Space.

Note 1. The first letters in category names represent the whole family, for example \p{L} represents all of Lu, Ll, Lt, Lm, Lo.
Note 2. To print a list of currently supported names launch the jregex.CharacterClass class as a java application:

java -cp .;[path to jregex.jar] jregex.CharacterClass >names.txt

`REFlags.UNICODE/"u"` flag

Turning on this flag in a pattern constructor forces a compiler to treat appropriate perl classes (\d, \D, \w, \W, \s, \S, etc) as belonging to Unicode. That is, \d becomes the same as \p{N} and so on.

Appendix D: Backslash issues

Backslash in regular expressions is used for switching between the literal and special meanings of some characters.

First, it turns all metacharacters (e.g. a special characters) into a corresponding plain characters. For example, a+ means "one or more repetitions of a", while a\+ stands for "a followed by a plus sign".

Second, some plain characters take on a special meaning when prepended by a backslash. For example, a regex d matches a character 'd', while \d matches a digit.

The matter of concern is that backslash is also a special character in the Java language, so its use in regular expressions interferes with its use in java literals. Put simply, each backslash in a regex should get escaped (i.e. prepended by an extra backslash) to produce a java literal:

Regex	Corresponding Java literal	Matches
d	"d"	A character 'd'
\d	"\\d"	A digit
\\	"\\\\"	A backslash
\\d	"\\\\d"	A backslash followed by a character 'd'
\\\d	"\\\\\\d"	A backslash followed by a digit
\	"\\"	Error: invalid pattern
n/a	"\"	Error: invalid Java expression

| | | | | | |

Pattern matching

Pattern searching

Replacing

String tokenizing

Filesystem utilities

Appendix A: Compilation flags

Appendix B: Using groups

Appendix C: Working with Unicode

Unicode characters

Unicode classes

REFlags.UNICODE/"u" flag

Appendix D: Backslash issues

`REFlags.UNICODE/"u"` flag