Home | Documentation | Foundry | Examples | Demo | Download | Feedback | Books
Getting started with jregex
  • Pattern matching
  • Pattern searching
  • Replacing Updated!
  • Tokenizing Updated!
  • Filesystem utilities New!
  • Appendix A: Compilation flags
  • Appendix B: Using groups
  • Appendix C: Unicode
  • Appendix D: Backslash issues

    Pattern matching

    Pattern matching stands here for testing whether the entire string matches a pattern.

    1. Create a Pattern instance:
       Pattern p=new Pattern("\\w+"); //a word pattern
                      
    2. Obtain a Matcher for a string:
       Matcher m=p.matcher(myText);
                      
    3. Test:
       if(m.matches()){
          System.out.println("The string is exactly a single word!");
       }
       else{
          System.out.println("The string is not a single word!");
       }
                      
    Upon a successful match you can retrieve groups(if a pattern contains any) as described in
    Appendix B: Using groups.

    Pattern searching

    Pattern searching stands here for searching for non-intersecting occurences of a pattern in a string. This can be done using find() method defined in class
    Matcher. When called the first time, find() starts searching from the zero position in a target, returning true if an occurence is found. On the following calls it will search starting from after the end of a previous match, making it possible to find all non-intersecting occurences of a pattern.

    1. Create a Pattern instance:
       Pattern p=new Pattern("\\w+"); //a word pattern
                      
    2. Obtain a Matcher for the string:
       Matcher m=p.matcher(myText);
                      
    3. Search:
       while(m.find()){
          System.out.println("next word: ["+m.toString()+"]");
       }
                      
    3.1. The same as 3 using MatchIterator paradigm:
       MatchIterator mi=m.findAll();
       while(mi.hasMore()){
          MatchResult mr=mi.nextMatch();
          System.out.println("a word found: "+mr.toString());
       }
                      
    Upon a successful search you can retrieve groups(if a pattern contains any) as described in Appendix B: Using groups.

    Replacing

    The Replacer class lets one to replace all occurences of a Pattern within a string by one of the following:
  • a plain string
  • a string with group references
  • a dynamically generated contents (using a Substitution interface)

    The group references could be either numeric ("$N" or "${N}", N is a
    group number) or symbolic ("${W}", W is a group name).

    The most simple usage is like so:
    Pattern p=new Pattern("(\\d\\d):(\\d\\d):(\\d\\d)");
    Replacer r=p.replacer("[hour=$1, minute=$2, second=$3]");
    //see also the constructor Replacer(Pattern,String,boolean)
    String result=r.replace("the time is 10:30:01");
    //gives "the time is [hour=10, minute=30, second=01]"
                      
    You can also to append the result either to a StringBuffer:
    StringBuffer sb=...;
    Replacer r=...;
    r.replace("the input string",sb);
    //now sb contains the result of replacement
                      
    or to a Writer:
    Writer out=...;
    Replacer r=...;
    r.replace("the input string",out);
                      
    or to any TextBuffer instance:
    Writer out=...;
    TextBuffer tb=Replacer.wrap(out);
    //see Replacer.wrap(Writer) and Replacer.wrap(StringBuffer)
    Replacer r=...;
    r.replace("the input string",tb);
                      
    Note that in the Replacer class there are a lot of similar methods for various input types.

    If some task requires a flexibility that perl-like substitution expressions couldn't provide, one can use a custom implementation of a Substitution interface. For example:
    Pattern p=new Pattern("(\\d+)\\+(\\d+)"); 
    Substitution add=new Substitution(){
       public void appendSubstitution(MatchResult match,TextBuffer dest){
          int a=Integer.parseInt(match.group(1));
          int b=Integer.parseInt(match.group(2));
          dest.append(String.valueOf(a+b));
       }
    }
    Replacer r=p.replacer(add);
    String result=r.replace("1+2 3+4");
    //"3 7"
                      

    String tokenizing

    String tokenizing using jregex's
    RETokenizer class is pretty similar to using a standard StringTokenizer class. The only difference is that RETokenizer uses a pattern occurence as a token delimiter:
       String theText=" Some --- strings --- separated by \"---\"";
       Pattern p=new Pattern("(?<!\")---(?!\")"); //three hyphens not enclosed in quotemarks
       RETokenizer tok=new RETokenizer(p,theText);
       while(tok.hasMore())System.out.print("Next token: "+tok.nextToken());
       //prints:
       // Some 
       // strings 
       // separated by "---"
                      
    RETokenizer has a split() method that allows to get all the tokens as a String array:
       Pattern p=...;
       String[] arr=p.tokenizer("input string").split();
                      
    There is an important issue regarding how the RETokenizer handles few adjacent delimiters, as it can take them either as a single delimiter or as several ones with the empty tokens between. One can control this behavoiur using the RETokenizer.setEmptyEnabled(boolean) method.

    Filesystem utilities

    1. The jregex.util.io.WildcardFilter class.

    This implementation of the java.io.FilenameFiler inteface lets one to filter files by their names using well-known wildcards, "?" and "*", where the first one matches any-character and the second matches any string. So, the "?or?" matches "wORd" and "mORe", and the "*or*" matches both above plus "transpORtation", "mirrOR", "ORbit", etc. Usage:
       File dir=...;
       String[] htmlFiles=dir.list(new WildcardFilter("*.html"));
                      
    2. The jregex.util.io.PathPattern class.

    This class has two possible applications:
  • to search files by path patterns
  • to match the system-dependent path strings against the system-independent patterns

    2.1. File search (the key method:
    PathPattern.enumerateFiles())

    The path pattern can be both relative and absolute and may the following wildcards:
  • ? - any-character
  • * - any-string
  • ** - any-path

    Some examples:
  • /** - all files and directories under the root;
  • /**/ - all directories under the root;
  • /**/tmp/**/*.java all .java files under the root that include a 'tmp' directory somewhere in the path;
  • ** - all files under the current directory, the same as **/*
  • **/*.java - all .java files under the current directory
  • */* - all files and directories that are one level below the current directory
  • *.j??? - all files in the current directory whose extension consists of 4 chars starting with 'j'

    Usage:
       PathPattern pp=new PathPattern("/tmp/**/*.java");
       Enumeratuion e=pp.enumerateFiles();
       while(e.hasMoreElement()){
          File f=(File)e.nextElement();
          f.delete();
       }
                      
    2.2. Path string matching (the key methods: Pattern.matches(String), Pattern.matcher(), etc)

    As a descendant of the jregex.Pattern, the PathPattern inherits all its functionality, allowing to search and match the path strings.
    For example, the pattern */*.java would match the following strings: foo/Bar.java, bar\Foo.java (on windows), and wouldn't match the FooBar, FooBar.java, foo/bar/FooBar.java.
    Note, that each wildcard takes a capturing group in the pattern.
    Usage:
       String myPath=...;
       Pattern p=new PathPattern("**/*"); //the "**" is the 1-st group, the "*" is the second
       Matcher m=p.matcher(myPath);
       if(m.matches()){
          System.out.println("file name: "+m.group(1));
          System.out.println("directory: "+m.group(2));
       }
                      

    Appendix A: Compilation flags

    Compilation flags allow to change the meaning of some syntax elements. These flags may be passed to Pattern constructor either as a string containing appropriate characters, or as a bitwise OR of some int constants. These flags are:
    int formstring formif enabledDefault state
    REFlags.IGNORE_CASE"i"Forces a matcher to ignore caseDisabled
    REFlags.MULTILINE"m"Forces a '^' tag to match BOLs and a '$' to match EOLsDisabled
    REFlags.DOTALL"s"Forces a '.' (dot) tag to match line separator charsDisabled
    REFlags.IGNORE_SPACES"x"Forces a compiler to ignore spaces in expression; allows to sparse a pattern for better readabilityDisabled
    REFlags.UNICODE"u"Forces a compiler to treat \w, \d, \s, etc. as relating to UnicodeDisabled
    REFlags.XML_SCHEMA"X"Enables compatibility with XML schema regular expressionsDisabled

    Passing flags through a string looks like "imsxuX-imsxuX" where chars before a hyphen enable appropriate flag, and after a hyphen disable it. Such string you can also embed into a pattern using the "(?imsxuX-imsxuX)" and "(?imsxuX-imsxuX:)" constructs. The first one sets flags for the rest part of the pattern, while the second sets flags for the enclosed part (that resides between the colon and the closing parenthesis).

    Appendix B: Using groups

    A group is a special construct in regular expressions that looks like a part of a pattern enclosed in parentheses. Also, two conditions must be met: parentheses should not be escaped and the opening one should not be followed by a non-escaped question mark ("(abc)" is a valid group for example). Groups can be nested.

    The groups has two functions.
  • The first, similarly to parentheses in arithmetics, it allows to treat its contents as a single item.
  • The second function is that upon a successful match the group "captures" ñorresponding part of the input, thus allowing to access this part independently.

    This is nearly most attractive feature of regular expressions, because not only we can describe some complex structure and then find the matching substring in a text, but also can immediately retrieve the values corresponding to a parts of this structure.

    For example, a pattern "(\d+)(\s+)(\d+)" has three groups, the 1-st capturing some digits, the second capturing some spaces, and the 3-rd capturing some digits again.

    To retrieve a contents of a group, we need to somehow address this group. For that sake the groups in an expression are automatically numbered. The numbering starts from "1" for the leftmost opening parenthesis, and the following opening parentheses (and so the whole groups) get their numbers in increasing order.

    Now, suppose we have created a pattern with groups myPattern, obtained a Matcher object (myMatcher), and have succeeded with any of
    find(), matches(), matchesPrefix(), proceed() methods (see searching, matching, incomplete matching, non-breaking search appropriately). Now we can:
  • find out a number of capturing groups:
       int gc=myPattern.groupCount();
       System.out.println("Group count: "+gc);
                      
  • test whether some group is captured;
  • find out where some group starts and ends, how long is it;
  • retrieve its contents:
                      
       for(int i=0;i<gc;i++){
          System.out.println("Group #"+i+":");
          
          if(!myMatcher.isCaptured(i)){                                 // see
             System.out.println("  Not captured, taking next..");
             continue;
          }
          
          System.out.println("  starts at "+myMatcher.start(i));        // see
          System.out.println("  ends at "+myMatcher.end(i));            // see
          System.out.println("  length: "+myMatcher.length(i));         // see
          System.out.println("  contents: \""+myMatcher.group(i)+"\""); // see
       }
                      
    Note that all methods dealing with retrieving information on a match are grouped in MatchResult interface, which is implemented by a Matcher class.

    Example
    Suppose we have a string myString (which actually is "The time is 15:20:45"), and we suspect it contain a time in a "hh:mm:ss" format. And if so, we want to know the minute. Let's begin:
    Pattern hms=new Pattern("\\b(\\d\\d):(\\d\\d):(\\d\\d)\\b"); // "\\b" tags a word boundary
    Matcher m=hms.matcher(myString);
    if(m.find()){
       System.out.println("Found!");
       String grp2=m.group(2);
       int minute=Integer.parseInt(grp2);
       System.out.println("The minute is "+minute);
       //prints "The minute is 20"
    }
    else{
       System.out.println("Not found :((");
    }
                      

    Appendix C: Working with Unicode

    Unicode characters

    jregex library supports Unicode both in patterns and in targets. For example, a pattern "\u0430" compiles into a pattern consisting of a single character (a cyrillic 'a') in its literal meaning, which will match the corresponding character '\u0430' in a string.

    A case insensitivity is also handled, so if a pattern is compiled with REFlags.IGNORE_CASE("i"), all characters will match their upper, lower and title case variants.

    Example: a pattern "[a-c\u0430-\u0432]" compiled with REFlags.IGNORE_CASE will match the following letters:
    'a','b','c',
    'A','B','C',
    '\u0430','\u0431','\u0432',
    '\u0410','\u0411','\u0412'

    Unicode classes

    The library has a number of predefined character classes dealing with
    Unicode blocks and categories. These classes have the following syntax:
  • \p{Name} - a positive class - characters belonging to specified block or category;
  • \P{Name} - a negative class - characters not belonging to specified block or category;

    The Name is one of the following:
  • Unicode categories: Cn, Cc, Cf, Co, Cs, Lu, Ll, Lt, etc...;
  • Unicode blocks: isBasicLatin, isLatin-1Supplement, isLatinExtended-A, etc...
  • POSIX classes: Lower, Upper, ASCII, Alpha, Digit, Alnum, Punct, Graph, Print, Blank, Cntrl, XDigit, Space.

    Note 1. The first letters in category names represent the whole family, for example \p{L} represents all of Lu, Ll, Lt, Lm, Lo.
    Note 2. To print a list of currently supported names launch the jregex.CharacterClass class as a java application:
    java -cp .;[path to jregex.jar] jregex.CharacterClass >names.txt
                      

    REFlags.UNICODE/"u" flag

    Turning on this flag in a pattern constructor forces a compiler to treat appropriate perl classes (\d, \D, \w, \W, \s, \S, etc) as belonging to Unicode. That is, \d becomes the same as \p{N} and so on.

    Appendix D: Backslash issues

    Backslash in regular expressions is used for switching between the literal and special meanings of some characters.

    First, it turns all metacharacters (e.g. a special characters) into a corresponding plain characters. For example, a+ means "one or more repetitions of a", while a\+ stands for "a followed by a plus sign".

    Second, some plain characters take on a special meaning when prepended by a backslash. For example, a regex d matches a character 'd', while \d matches a digit.

    The matter of concern is that backslash is also a special character in the Java language, so its use in regular expressions interferes with its use in java literals. Put simply, each backslash in a regex should get escaped (i.e. prepended by an extra backslash) to produce a java literal:
    RegexCorresponding Java literalMatches
    d"d"A character 'd'
    \d"\\d"A digit
    \\"\\\\"A backslash
    \\d"\\\\d"A backslash followed by a character 'd'
    \\\d"\\\\\\d"A backslash followed by a digit
    \"\\"Error: invalid pattern
    n/a"\"Error: invalid Java expression

  • Home | Documentation | Foundry | Examples | Demo | Download | Feedback | Books
    Copyright 2000-2002 S. A. Samokhodkin