| | | | | | |

Syntax reference

Introduction

A syntax used in jregex library is a superset of Perl5.6 * regular expression syntax. The difference are named groups (and their counterparts named backreferences) presenting in jregex and not presenting in perl.

What is a regular expression

Regular expressions in a wide sense are a small powerfull language that can perform complex text manipulations and extract data. In a narrow sense, a regular expression, or a regex, is a string, consisting of symbols that represent either elements or operations on elements, where each element matches some set of strings.

Basic operations

Let E₁ be an element matching string S₁, E₂ be an element matching string S₂, and REGEX be an element composition.
Then the most basic operations on them are:

Operation	Syntax	Meaning
Identity operation on `E₁`	`E₁`	matches `S₁`
Concatenation of elements `E₁` and `E₂`	`E₁E₂`	matches `S₁S₂`
Alternation of elements `E₁` and `E₂`	`E₁\|E₂`	matches either `S₁` or `S₂`
Kleene closure of element `E`	`E*`	zero or more concatenations of `E`
Positive closure of element `E`	`E+`	one or more concatenations of `E`
Grouping	`(?:REGEX)`	lets to treat the `REGEX` as a single element
Grouping	`(REGEX)`	besides the above function, it stores the corresponding part of a match in memory

Metacharacters

The following characters (called metacharacters) have special meaning in regular expressions, and therefore cannot be used literaly:
` . * + ? { } [ ] ( ) | \ ^ $ '

Elements:

characters ( "a" , "A" , ... )

character classes ( "[a-z0-9]" , "[^a-z0-9]" , "\d" , "\D" , ... )

named clases ( "\p{name}" , "\P{name}" )

combined classes ( "(?C₁+C₂-C₃&C₄)" )

anchors ( "^" , "$" , "\b" , "\B" , "\A" , "\Z" , ... )

groups ( "(REGEX)" , "(?:REGEX)" )

named groups ( "({Name}REGEX)")

backreferences ( "\1" , "\2" , ... )

named backreferences ( "{\NAME}")

quantifiers ( "ELEMENT?" , "ELEMENT*" , "ELEMENT+" , "ELEMENT{n}" , "ELEMENT{m,n}" , ... )

assertions ( "(?=REGEX)" , "(?!REGEX)" , "(?<=REGEX)" , "(?<!REGEX)" )

comments ( "(?# it's a comment )" )

special expressions ( "(?(condition)yes-pattern|no-pattern)" , "(?>REGEX)" , "(?@C₁C₂)" , ... )

Characters.

If a character 'C' is not a metacharacter, then element C matches exactly the string "C" (i.e. the literal meaning of itself).

If a character 'C' is a metacharacter, then element \C matches "C" (i.e. the literal meaning of 'C').

non-printable characters:

\e \f \n \r \t

match ESC, FF, LF, CR, TAB.

\cC , where C is a character

matches a control-C

\xHH or \x{H..H}

matches a character of which code is hH..H in Unicode

Example:
regex "abc" matches string "abc" ;
regex "$\.$" matches string "(.)" ;
regex "\x31\x{32}\x{0033}" matches a string "123" .

Character classes.

[R₁R₂R₃...R_n] - a positive character class, matches any of R₁,R₂,R₃,...,R_n,

[^R₁R₂R₃...R_n] - a negative character class, matches a character that is not any of R₁,R₂,R₃,...,R_n,

where R_i is one of the folloing:

- a character, including \e,\f,\n,\r,\t,\xHH,\x{HHHH}

- a metacheracter; so metacharecters inside [...] have their literal meaning

C₁-C₂

a character range

\d \D \s \S \w \W

embedded classes

\p{name} \P{name}

named classes

. - matches any character if "s" flag turned on, otherwise, matches any character except an EOL chars;

\d - matches a digit, equivalent to [0-9];

\D - matches not a digit, equivalent to [^0-9];

\w - matches a word character, equivalent to [a-zA-Z_0-9];

\W - matches not a word character, equivalent to [^a-zA-Z_0-9];

\s - matches a whitespace, equivalent to [ \f\n\r\t];

\S - matches not a whitespace, equivalent to [^ \f\n\r\t];

Named character classes.

\p{name}

matches a character belonging to a specified Unicode block or category.

\P{name}

matches a character not belonging to a specified Unicode block or category.

Combined classes.

(?R₁+R₂-R₃&R₄) - an always positive character class combination, matches all that falls in R₁ or R₂, and not falls in R₃, and falls in R₄;

R_i - any of character classes above; brackets must present: [\w], [\p{L}] rather than \w, \p{L};

+ - set addition;

- - set subtraction;

& - set intersection.

grouping not implemented

Example:
regex "(?[a-zA-Z]-[\p{Lu}]+[X])" matches either any of a,b,c,...,z or X.

Anchors.

\A - matches the beginning of the text;

\G - matches the end of the last match or the beginning of the text;

\Z - matches the end of the text, or before an EOL at the end of the text;

\z - matches the end of the text

^ - matches the beginning of the text; if compilation flag 'm' is set, it also matches after an EOL.

$ - matches the same as \Z; if compilation flag 'm' is set, it also matches before an EOL;

\b - word boundary;

\< - matches the beginning of a word;

\> - matches the end of a word;

Groups.

(?:REGEX) - non-capturing group, REGEX is a regular expression. Allows to treat a REGEX as a single element.

(REGEX) - capturing group, REGEX is a regular expression. First, it has the same function as a plain group. Besides that, such group captures corresponding part of a match. Groups are numbered from 1 upward, in order of appearance of their opening parenthesis. The contents of a group can be is used in two ways:
1. it can be retrieved from a matcher upon a successful search;
2. it can be backreferenced inside a pattern.

Named groups.

({NAME}REGEX) - capturing group, REGEX is a regular expression, NAME is either a word or a decimal number. If NAME is a word (not a number), the group is assined a symbolic name. Otherwise, if NAME is a decimal number, the group is assigned a corresponding numeric id.
The contents of a group can be is used in two ways:
1. it can be retrieved from a matcher upon a successful search;
2. it can be backreferenced using a named backreference inside a pattern.

Backreferences.

\1 \2 \3 \4 \5 \6 \7 \8 \9 - backreferences to the 1-st, 2-nd, ... , 9-th group. A backreference to N-th group ("\N") matches a substring that is currently captured by this group. In a case when nothing is captured, it fails.
Example:
regex "(\w)\1" matches double word chars (namely "aa" and "cc") in a string "123aabccd";

Named backreferences.

{\MyGroup} - backreference to a named group "MyGroup".

{\N} - backreference to an N-th group, the same as a simple \N
Except syntax, named backreferences behaves the same way as usual backreferences.
Example:
regexes "({Letter}\w){\Letter}" and "({1}\w){\1}" match double word chars (namely "aa" and "cc") in a string "123aabccd";

Quantifiers
Let E be an element, then:

E* - matches a serie of 0 or more of element E, first trying to match as large repetition number as possible (the 'greedy' behaviour);

E+ - matches a serie of 1 or more of element E, first trying to match as large repetition number as possible;

E? - matches 1 or 0 of E, first trying 1;

E{n} - matches a serie of exactly n of element E;

E{m,n} - matches a serie of m to n of element E, first trying to match as large repetition number as possible;

E{,n} - matches a serie of 0 to n of element E, first trying to match as large repetition number as possible;

E{m,} - matches a serie of m or more of element E, first trying to match as large repetition number as possible;

E*? , E+? , E?? , E{n}? , E{m,n}? , E{,n}? , E{m,}? - the '?' after a quantifier changes its behaviour: now it first matches a least possible repetition number;
Note that if a pattern allows E to be zero-width (such as "\b" or "(:?abc|)"), then any of E* , E+ , E{m,} may cause an infinite loop.

Assertions.

(?=REGEX) - positive lookahead assertion. This provides that the REGEX matches at the current location, yet its corresponding match isn't included into the result. So the assertions may be regarded as a zero-width elements.

(?!REGEX) - negative lookahead assertion. This succeeds if the REGEX fails to match at the current location.

(?<=REGEX) - positive lookbehind assertion. This succeeds if the text that preceedes the current location matches the REGEX. The REGEX must be fixed-width.

(?<!REGEX) - negative lookbehind assertion. The REGEX must be fixed-width.
Example:
regex "(?=.*\?)\b\w+\b" matches "is" in a sring "is anybody here?" and fails to match "is anybody here.";

Comments.

(?# comment )
Example:
regex "ab(?# my comment )cd" matches "abcd"

Special expressions.

(?(condition)REGEX1|REGEX2) - conditional expression. If the condition is met, the element works as REGEX1, and othewise as REGEX2. The condition is one of the following:

a decimal number N

this condition meets if an N'th group is captured;

an assertion

meets if an assertion succeeds;

(?>REGEX) - independent expression. Is used to bypass an unnecessary backtracking.

Links
1.PerRE manpage - an original source for perl5.6 regex syntax.
2.PerRE tutorial
3.Regular expression HOWTO - Python-oriented; from the very beginning to advanced concepts. Easy to read.
4.Regex syntax in java.util.regex - since JDK 1.4

References
[*] Perl has its own flavour of regex syntax, which is widely adopted as standard.

| | | | | | |