Home | Documentation | Foundry | Examples | Demo | Download | Feedback | Books
Syntax reference

Introduction

A syntax used in jregex library is a superset of Perl5.6
* regular expression syntax. The difference are named groups (and their counterparts named backreferences) presenting in jregex and not presenting in perl.

What is a regular expression

Regular expressions in a wide sense are a small powerfull language that can perform complex text manipulations and extract data. In a narrow sense, a regular expression, or a regex, is a string, consisting of symbols that represent either elements or operations on elements, where each element matches some set of strings.

Basic operations

Let E1 be an element matching string S1, E2 be an element matching string S2, and REGEX be an element composition.
Then the most basic operations on them are:
OperationSyntaxMeaning
Identity operation on E1E1matches S1
Concatenation of elements E1 and E2E1E2matches S1S2
Alternation of elements E1 and E2E1|E2matches either S1 or S2
Kleene closure of element EE*zero or more concatenations of E
Positive closure of element EE+one or more concatenations of E
Grouping(?:REGEX)lets to treat the REGEX as a single element
(REGEX)besides the above function, it stores the corresponding part of a match in memory

Metacharacters

The following characters (called metacharacters) have special meaning in regular expressions, and therefore cannot be used literaly:
` . * + ? { } [ ] ( ) | \ ^ $ '

Elements:

  • characters ( "a" , "A" , ... )
  • character classes ( "[a-z0-9]" , "[^a-z0-9]" , "\d" , "\D" , ... )
  • named clases ( "\p{name}" , "\P{name}" )
  • combined classes ( "(?C1+C2-C3&C4)" )
  • anchors ( "^" , "$" , "\b" , "\B" , "\A" , "\Z" , ... )
  • groups ( "(REGEX)" , "(?:REGEX)" )
  • named groups ( "({Name}REGEX)")
  • backreferences ( "\1" , "\2" , ... )
  • named backreferences ( "{\NAME}")
  • quantifiers ( "ELEMENT?" , "ELEMENT*" , "ELEMENT+" , "ELEMENT{n}" , "ELEMENT{m,n}" , ... )
  • assertions ( "(?=REGEX)" , "(?!REGEX)" , "(?<=REGEX)" , "(?<!REGEX)" )
  • comments ( "(?# it's a comment )" )
  • special expressions ( "(?(condition)yes-pattern|no-pattern)" , "(?>REGEX)" , "(?@C1C2)" , ... )

    Characters.

  • If a character 'C' is not a metacharacter, then element C matches exactly the string "C" (i.e. the literal meaning of itself).
  • If a character 'C' is a metacharacter, then element \C matches "C" (i.e. the literal meaning of 'C').
  • non-printable characters:
    \e \f \n \r \t
    match ESC, FF, LF, CR, TAB.
    \cC , where C is a character
    matches a control-C
    \xHH or \x{H..H}
    matches a character of which code is hH..H in Unicode
    Example:
    regex "abc" matches string "abc" ;
    regex "\(\.\)" matches string "(.)" ;
    regex "\x31\x{32}\x{0033}" matches a string "123" .

    Character classes.

  • [R1R2R3...Rn] - a positive character class, matches any of R1,R2,R3,...,Rn,
  • [^R1R2R3...Rn] - a negative character class, matches a character that is not any of R1,R2,R3,...,Rn,
  • where Ri is one of the folloing:
    C
    - a character, including \e,\f,\n,\r,\t,\xHH,\x{HHHH}
    - a metacheracter; so metacharecters inside [...] have their literal meaning
    C1-C2
    a character range
    \d \D \s \S \w \W
    embedded classes
    \p{name} \P{name}
    named classes
  • . - matches any character if "s" flag turned on, otherwise, matches any character except an EOL chars;
  • \d - matches a digit, equivalent to [0-9];
  • \D - matches not a digit, equivalent to [^0-9];
  • \w - matches a word character, equivalent to [a-zA-Z_0-9];
  • \W - matches not a word character, equivalent to [^a-zA-Z_0-9];
  • \s - matches a whitespace, equivalent to [ \f\n\r\t];
  • \S - matches not a whitespace, equivalent to [^ \f\n\r\t];

    Named character classes.

    \p{name}
    matches a character belonging to a specified Unicode block or category.
    \P{name}
    matches a character not belonging to a specified Unicode block or category.

    Combined classes.

  • (?R1+R2-R3&R4) - an always positive character class combination, matches all that falls in R1 or R2, and not falls in R3, and falls in R4;
  • Ri - any of character classes above; brackets must present: [\w], [\p{L}] rather than \w, \p{L};
  • + - set addition;
  • - - set subtraction;
  • & - set intersection.
  • grouping not implemented

    Example:
    regex "(?[a-zA-Z]-[\p{Lu}]+[X])" matches either any of a,b,c,...,z or X.

    Anchors.

  • \A - matches the beginning of the text;
  • \G - matches the end of the last match or the beginning of the text;
  • \Z - matches the end of the text, or before an EOL at the end of the text;
  • \z - matches the end of the text
  • ^ - matches the beginning of the text; if compilation flag 'm' is set, it also matches after an EOL.
  • $ - matches the same as \Z; if compilation flag 'm' is set, it also matches before an EOL;
  • \b - word boundary;
  • \< - matches the beginning of a word;
  • \> - matches the end of a word;

    Groups.

  • (?:REGEX) - non-capturing group, REGEX is a regular expression. Allows to treat a REGEX as a single element.
  • (REGEX) - capturing group, REGEX is a regular expression. First, it has the same function as a plain group. Besides that, such group captures corresponding part of a match. Groups are numbered from 1 upward, in order of appearance of their opening parenthesis. The contents of a group can be is used in two ways:
    1. it can be retrieved from a matcher upon a successful search;
    2. it can be
    backreferenced inside a pattern.

    Named groups.

  • ({NAME}REGEX) - capturing group, REGEX is a regular expression, NAME is either a word or a decimal number. If NAME is a word (not a number), the group is assined a symbolic name. Otherwise, if NAME is a decimal number, the group is assigned a corresponding numeric id.
    The contents of a group can be is used in two ways:
    1. it can be retrieved from a matcher upon a successful search;
    2. it can be backreferenced using a
    named backreference inside a pattern.

    Backreferences.

  • \1 \2 \3 \4 \5 \6 \7 \8 \9 - backreferences to the 1-st, 2-nd, ... , 9-th group. A backreference to N-th group ("\N") matches a substring that is currently captured by this group. In a case when nothing is captured, it fails.
    Example:
    regex "(\w)\1" matches double word chars (namely "aa" and "cc") in a string "123aabccd";

    Named backreferences.

  • {\MyGroup} - backreference to a named group "MyGroup".
  • {\N} - backreference to an N-th group, the same as a simple \N
    Except syntax, named backreferences behaves the same way as usual backreferences.
    Example:
    regexes "({Letter}\w){\Letter}" and "({1}\w){\1}" match double word chars (namely "aa" and "cc") in a string "123aabccd";

    Quantifiers
    Let E be an
    element, then:

  • E* - matches a serie of 0 or more of element E, first trying to match as large repetition number as possible (the 'greedy' behaviour);
  • E+ - matches a serie of 1 or more of element E, first trying to match as large repetition number as possible;
  • E? - matches 1 or 0 of E, first trying 1;
  • E{n} - matches a serie of exactly n of element E;
  • E{m,n} - matches a serie of m to n of element E, first trying to match as large repetition number as possible;
  • E{,n} - matches a serie of 0 to n of element E, first trying to match as large repetition number as possible;
  • E{m,} - matches a serie of m or more of element E, first trying to match as large repetition number as possible;
  • E*? , E+? , E?? , E{n}? , E{m,n}? , E{,n}? , E{m,}? - the '?' after a quantifier changes its behaviour: now it first matches a least possible repetition number;
    Note that if a pattern allows E to be zero-width (such as "\b" or "(:?abc|)"), then any of E* , E+ , E{m,} may cause an infinite loop.

    Assertions.

  • (?=REGEX) - positive lookahead assertion. This provides that the REGEX matches at the current location, yet its corresponding match isn't included into the result. So the assertions may be regarded as a zero-width elements.
  • (?!REGEX) - negative lookahead assertion. This succeeds if the REGEX fails to match at the current location.
  • (?<=REGEX) - positive lookbehind assertion. This succeeds if the text that preceedes the current location matches the REGEX. The REGEX must be fixed-width.
  • (?<!REGEX) - negative lookbehind assertion. The REGEX must be fixed-width.
    Example:
    regex "(?=.*\?)\b\w+\b" matches "is" in a sring "is anybody here?" and fails to match "is anybody here.";

    Comments.

  • (?# comment )
    Example:
    regex "ab(?# my comment )cd" matches "abcd"

    Special expressions.

  • (?(condition)REGEX1|REGEX2) - conditional expression. If the condition is met, the element works as REGEX1, and othewise as REGEX2. The condition is one of the following:
    a decimal number N
    this condition meets if an N'th group is captured;
    an assertion
    meets if an assertion succeeds;
  • (?>REGEX) - independent expression. Is used to bypass an unnecessary backtracking.


    Links
    1.PerRE manpage - an original source for perl5.6 regex syntax.
    2.PerRE tutorial
    3.
    Regular expression HOWTO - Python-oriented; from the very beginning to advanced concepts. Easy to read.
    4.Regex syntax in java.util.regex - since JDK 1.4



    References
    [*] Perl has its own flavour of regex syntax, which is widely adopted as standard.

  • Home | Documentation | Foundry | Examples | Demo | Download | Feedback | Books
    Copyright 2000-2002 S. A. Samokhodkin