Home | Documentation | Foundry | Examples | Demo | Download | Feedback | Books
Unicode blocks and categories
This page describes Unicode block and categories, whose names are used in jregex predefined classes.
Categories
Blocks
Viewing supported names
Links

Categories

From Unicode Standard, Chapter 4.5: "The General Category constitues a partition of the characters into several major classes, such as letters,punctuation, and symbols, and further subclasses for each of the major classes.
Each Unicode character is assigned a General Category value. Each value of the General Category is defined as a two-letter abbreviation, where the first letter gives information about a major class and the second letter designates a subclass of that major class. In each class the subclass 'other' merely collects the remaining characters of the major class."
- see
Chapter 4 of Unicode Standard, v.3.0

Currently supported categories are:
 Normative
     Mn = Mark, Non-Spacing
     Mc = Mark, Spacing Combining
     Me = Mark, Enclosing

     Nd = Number, Decimal Digit
     Nl = Number, Letter
     No = Number, Other

     Zs = Separator, Space
     Zl = Separator, Line
     Zp = Separator, Paragraph

     Cc = Other, Control
     Cf = Other, Format
     Cs = Other, Surrogate
     Co = Other, Private Use
     Cn = Other, Not Assigned

 Informative
     Lu = Letter, Uppercase
     Ll = Letter, Lowercase
     Lt = Letter, Titlecase
     Lm = Letter, Modifier
     Lo = Letter, Other

     Pc = Punctuation, Connector
     Pd = Punctuation, Dash
     Ps = Punctuation, Open
     Pe = Punctuation, Close
    *Pi = Punctuation, Initial quote
    *Pf = Punctuation, Final quote
     Po = Punctuation, Other

     Sm = Symbol, Math
     Sc = Symbol, Currency
     Sk = Symbol, Modifier
     So = Symbol, Other
 
 *Unsupported by Java (and hence unsupported by jregex). 
                 

Blocks

From
Unicode glossary:
"Block.
A grouping of related characters within the Unicode encoding space. A block may contain unassigned positions, which are reserved.
"

A list of unicode blocks along with their boundaries is
here(local copy) and here(at www.unicode.org).

How to use these names in patterns
  • 1-st, remove all spaces;
  • 2-nd, prepended each name with "In", so:

    'Basic Latin' is used as \p{InBasicLatin},
    'Latin-1 Supplement' is used as \p{InLatin-1Supplement},
    'Cyrillic' is used as \p{InCyrillic},
    'Armenian' is used as \p{InArmenian},
    etc.

    Viewing supported names

  • To view all supported names, launch a jregex.CharacterClass as a java app with no arguments
  • To make sure that particular names are supported, launch a jregex.CharacterClass as a java app supplying these names as arguments

    Links

    Unicode charts
    Unicode character database
  • UnicodeData.txt

  • Home | Documentation | Foundry | Examples | Demo | Download | Feedback | Books
    Copyright 2000-2002 S. A. Samokhodkin