org.apache.xmlbeans.impl.regex.RegularExpression

All Implemented Interfaces:: Serializable

Direct Known Subclasses:: SchemaRegularExpression

public class RegularExpression
extends Object
implements Serializable

A regular expression matching engine using Non-deterministic Finite Automaton (NFA). This engine does not conform to the POSIX regular expression.

How to use

A. Standard way

 RegularExpression re = new RegularExpression(regex);
 if (re.matches(text)) { ... }

B. Capturing groups

 RegularExpression re = new RegularExpression(regex);
 Match match = new Match();
 if (re.matches(text, match)) {
     ... // You can refer captured texts with methods of the Match class.
 }

Case-insensitive matching

 RegularExpression re = new RegularExpression(regex, "i");
 if (re.matches(text) >= 0) { ...}

Options

You can specify options to RegularExpression(regex, options) or setPattern(regex, options). This options parameter consists of the following characters.

"i": This option indicates case-insensitive matching.
"m": ^ and $ consider the EOL characters within the text.
"s": . matches any one character.
"u": Redefines \d \D \w \W \s \S \b \B \< \> as becoming to Unicode.
"w": By this option, \b \B \< \> are processed with the method of 'Unicode Regular Expression Guidelines' Revision 4. When "w" and "u" are specified at the same time, \b \B \< \> are processed for the "w" option.
",": The parser treats a comma in a character class as a range separator. [a,b] matches a or , or b without this option. [a,b] matches a or b with this option.
"X": By this option, the engine confoms to XML Schema: Regular Expression. The match() method does not do subsring matching but entire string matching.

Syntax

Differences from the Perl 5 regular expression

There is 6-digit hexadecimal character representation (\vHHHHHH.)
Supports subtraction, union, and intersection operations for character classes.
Not supported: \ooo (Octal character representations), \G, \C, \lc, \ uc, \L, \U, \E, \Q, \N{name}, (?{code}), (??{code})

Meta characters are `. * + ? { [ ( ) | \ ^ $'.

Character

. (A period)
Matches any one character except the following characters.
LINE FEED (U+000A), CARRIAGE RETURN (U+000D), PARAGRAPH SEPARATOR (U+2029), LINE SEPARATOR (U+2028)
This expression matches one code point in Unicode. It can match a pair of surrogates.
When the "s" option is specified, it matches any character including the above four characters.
\e \f \n \r \t
Matches ESCAPE (U+001B), FORM FEED (U+000C), LINE FEED (U+000A), CARRIAGE RETURN (U+000D), HORIZONTAL TABULATION (U+0009)
\cC
Matches a control character. The C must be one of '@', 'A'-'Z', '[', '\', ']', '^', '_'. It matches a control character of which the character code is less than the character code of the C by 0x0040.
For example, a \cJ matches a LINE FEED (U+000A), and a \c[ matches an ESCAPE (U+001B).
a non-meta character
Matches the character.
\ + a meta character
Matches the meta character.
\xHH \x{HHHH}
Matches a character of which code point is HH (Hexadecimal) in Unicode. You can write just 2 digits for \xHH, and variable length digits for \x{HHHH}.
\vHHHHHH
Matches a character of which code point is HHHHHH (Hexadecimal) in Unicode.
\g
Matches a grapheme.
It is equivalent to (?[\p{ASSIGNED}]-[\p{M}\p{C}])?(?:\p{M}|[\x{094D}\x{09CD}\x{0A4D}\x{0ACD}\x{0B3D}\x{0BCD}\x{0C4D}\x{0CCD}\x{0D4D}\x{0E3A}\x{0F84}]\p{L}|[\x{1160}-\x{11A7}]|[\x{11A8}-\x{11FF}]|[\x{FF9E}\x{FF9F}])*
\X
Matches a combining character sequence. It is equivalent to (?:\PM\pM*)
Character class
+ *
[R₁R₂...R_n] (without "," option) + *
[R₁,R₂,...,R_n] (with "," option)
Positive character class. It matches a character in ranges.
R_n:
- A character (including \e \f \n \r \t \xHH \x{HHHH} \vHHHHHH)
  This range matches the character.
- C₁-C₂
  This range matches a character which has a code point that is >= C₁'s code point and <= C₂'s code point. + *
- A POSIX character class: [:alpha:] [:alnum:] [:ascii:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:], + * and negative POSIX character classes in Perl like [:^alpha:]
  ...
- \d \D \s \S \w \W \p{name} \P{name}
  These expressions specifies the same ranges as the following expressions.
Enumerated ranges are merged (union operation). [a-ec-z] is equivalent to [a-z]
[^R₁R₂...R_n] (without a "," option)
[^R₁,R₂,...,R_n] (with a "," option)
Negative character class. It matches a character not in ranges.
(?[ranges]op[ranges]op[ranges] ... ) (op is - or + or &.)
Subtraction or union or intersection for character classes.
For exmaple, (?[A-Z]-[CF]) is equivalent to [A-BD-EG-Z], and (?[0x00-0x7f]-[K]&[\p{Lu}]) is equivalent to [A-JL-Z].
The result of this operations is a positive character class even if an expression includes any negative character classes. You have to take care on this in case-insensitive matching. For instance, (?[^b]) is equivalent to [\x00-ac-\x{10ffff}], which is equivalent to [^b] in case-sensitive matching. But, in case-insensitive matching, (?[^b]) matches any character because it includes 'B' and 'B' matches 'b' though [^b] is processed as [^Bb].
[R₁R₂...-[R_nR_n+1...]] (with an "X" option)

Character class subtraction for the XML Schema. You can use this syntax when you specify an "X" option.
\d
Equivalent to [0-9].
When a "u" option is set, it is equivalent to \p{Nd}.
\D
Equivalent to [^0-9]
When a "u" option is set, it is equivalent to \P{Nd}.
\s
Equivalent to [ \f\n\r\t]
When a "u" option is set, it is equivalent to [ \f\n\r\t\p{Z}].
\S
Equivalent to [^ \f\n\r\t]
When a "u" option is set, it is equivalent to [^ \f\n\r\t\p{Z}].
\w
Equivalent to [a-zA-Z0-9_]
When a "u" option is set, it is equivalent to [\p{Lu}\p{Ll}\p{Lo}\p{Nd}_].
\W
Equivalent to [^a-zA-Z0-9_]
When a "u" option is set, it is equivalent to [^\p{Lu}\p{Ll}\p{Lo}\p{Nd}_].
\p{name}
Matches one character in the specified General Category (the second field in UnicodeData.txt) or the specified Block. The following names are available:

Unicode General Categories:
L, M, N, Z, C, P, S, Lu, Ll, Lt, Lm, Lo, Mn, Me, Mc, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cn, Co, Cs, Pd, Ps, Pe, Pc, Po, Sm, Sc, Sk, So,
(Currently the Cn category includes U+10000-U+10FFFF characters)
Unicode Blocks:
Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, Georgian, Hangul Jamo, Latin Extended Additional, Greek Extended, General Punctuation, Superscripts and Subscripts, Currency Symbols, Combining Marks for Symbols, Letterlike Symbols, Number Forms, Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition, Enclosed Alphanumerics, Box Drawing, Block Elements, Geometric Shapes, Miscellaneous Symbols, Dingbats, CJK Symbols and Punctuation, Hiragana, Katakana, Bopomofo, Hangul Compatibility Jamo, Kanbun, Enclosed CJK Letters and Months, CJK Compatibility, CJK Unified Ideographs, Hangul Syllables, High Surrogates, High Private Use Surrogates, Low Surrogates, Private Use, CJK Compatibility Ideographs, Alphabetic Presentation Forms, Arabic Presentation Forms-A, Combining Half Marks, CJK Compatibility Forms, Small Form Variants, Arabic Presentation Forms-B, Specials, Halfwidth and Fullwidth Forms
Others:
ALL (Equivalent to [\u0000-\v10FFFF])
ASSGINED (\p{ASSIGNED} is equivalent to \P{Cn})
UNASSGINED (\p{UNASSIGNED} is equivalent to \p{Cn})

\P{name}
Matches one character not in the specified General Category or the specified Block.
Selection and Quantifier

X|Y
...
X*
Matches 0 or more X.
X+
Matches 1 or more X.
X?
Matches 0 or 1 X.
X{number}
Matches number times.
X{min,}
...
X{min,max}
...
X*?
X+?
X??
X{min,}?
X{min,max}?
Non-greedy matching.
Grouping, Capturing, and Back-reference
(?:X)
Grouping. "foo+" matches "foo" or "foooo". If you want it matches "foofoo" or "foofoofoo", you have to write "(?:foo)+".
(X)
Grouping with capturing. It make a group and applications can know where in target text a group matched with methods of a Match instance after matches(String,Match). The 0th group means whole of this regular expression. The Nth gorup is the inside of the Nth left parenthesis.
For instance, a regular expression is " *([^<:]*) +<([^>]*)> *" and target text is "From: TAMURA Kent <kent@trl.ibm.co.jp>":
- Match.getCapturedText(0): " TAMURA Kent <kent@trl.ibm.co.jp>"
- Match.getCapturedText(1): "TAMURA Kent"
- Match.getCapturedText(2): "kent@trl.ibm.co.jp"
\1 \2 \3 \4 \5 \6 \7 \8 \9

(?>X)
Independent expression group. ................
(?options:X)
(?options-options2:X)
............................
The options or the options2 consists of 'i' 'm' 's' 'w'. Note that it can not contain 'u'.
(?options)
(?options-options2)
......
These expressions must be at the beginning of a group.
Anchor

\A
Matches the beginnig of the text.
\Z
Matches the end of the text, or before an EOL character at the end of the text, or CARRIAGE RETURN + LINE FEED at the end of the text.
\z
Matches the end of the text.
^
Matches the beginning of the text. It is equivalent to \A.
When a "m" option is set, it matches the beginning of the text, or after one of EOL characters ( LINE FEED (U+000A), CARRIAGE RETURN (U+000D), LINE SEPARATOR (U+2028), PARAGRAPH SEPARATOR (U+2029).)
$
Matches the end of the text, or before an EOL character at the end of the text, or CARRIAGE RETURN + LINE FEED at the end of the text.
When a "m" option is set, it matches the end of the text, or before an EOL character.
\b
Matches word boundary. (See a "w" option)
\B
Matches non word boundary. (See a "w" option)
\<
Matches the beginning of a word. (See a "w" option)
\>
Matches the end of a word. (See a "w" option)
Lookahead and lookbehind

(?=X)
Lookahead.
(?!X)
Negative lookahead.
(?<=X)
Lookbehind.
(Note for text capturing......)
(?<!X)
Negative lookbehind.
Misc.

(?(condition)yes-pattern|no-pattern),
(?(condition)yes-pattern)
......
(?#comment)
Comment. A comment string consists of characters except ')'. You can not write comments in character classes and before quantifiers.

BNF for the regular expression

 regex ::= ('(?' options ')')? term ('|' term)*
 term ::= factor+
 factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )?
            | '(?#' [^)]* ')'
 minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}'
 atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9]
          | '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X'
          | '(?>' regex ')' | '(?' options ':' regex ')'
          | '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')'
 options ::= [imsw]* ('-' [imsw]+)?
 anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>'
 looks ::= '(?=' regex ')'  | '(?!' regex ')'
           | '(?<=' regex ')' | '(?<!' regex ')'
 char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1
 category-block ::= '\' [pP] category-symbol-1
                    | ('\p{' | '\P{') (category-symbol | block-name
                                       | other-properties) '}'
 category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S'
 category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo'
                     | 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No'
                     | 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs'
                     | 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po'
                     | 'Sm' | 'Sc' | 'Sk' | 'So'
 block-name ::= (See above)
 other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED'
 character-1 ::= (any character except meta-characters)

 char-class ::= '[' ranges ']'
                | '(?[' ranges ']' ([-+&] '[' ranges ']')? ')'
 ranges ::= '^'? (range ','?)+
 range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block
           | range-char | range-char '-' range-char
 range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2
 code-point ::= '\x' hex-char hex-char
                | '\x{' hex-char+ '}'
                | '\v' hex-char hex-char hex-char hex-char hex-char hex-char
 hex-char ::= [0-9a-fA-F]
 character-2 ::= (any character except \[]-,)

TODO

Unicode Regular Expression Guidelines
- 2.4 Canonical Equivalents
- Level 3
Parsing performance

Version:: $Id: RegularExpression.java 111285 2004-12-08 16:54:26Z cezar $
Author:: TAMURA Kent <kent@trl.ibm.co.jp>
See Also:: Serialized Form

Constructor Summary

Constructors
Constructor	Description
`RegularExpression(String regex)`	Creates a new RegularExpression instance.
`RegularExpression(String regex, String options)`	Creates a new RegularExpression instance with options.

Method Summary

Modifier and Type	Method	Description
`boolean`	`equals(Object obj)`	Return true if patterns are the same and the options are equivalent.
`int`	`getNumberOfGroups()`	Return the number of regular expression groups.
`String`	`getOptions()`	Returns a option string.
`String`	`getPattern()`
`int`	`hashCode()`
`boolean`	`matches(char[] target)`	Checks whether the `target` text contains this pattern or not.
`boolean`	`matches(char[] target, int start, int end)`	Checks whether the `target` text contains this pattern in specified range or not.
`boolean`	`matches(char[] target, int start, int end, Match match)`	Checks whether the `target` text contains this pattern in specified range or not.
`boolean`	`matches(char[] target, Match match)`	Checks whether the `target` text contains this pattern or not.
`boolean`	`matches(String target)`	Checks whether the `target` text contains this pattern or not.
`boolean`	`matches(String target, int start, int end)`	Checks whether the `target` text contains this pattern in specified range or not.
`boolean`	`matches(String target, int start, int end, Match match)`	Checks whether the `target` text contains this pattern in specified range or not.
`boolean`	`matches(String target, Match match)`	Checks whether the `target` text contains this pattern or not.
`boolean`	`matches(CharacterIterator target)`	Checks whether the `target` text contains this pattern or not.
`boolean`	`matches(CharacterIterator target, Match match)`	Checks whether the `target` text contains this pattern or not.
`void`	`setPattern(String newPattern)`
`void`	`setPattern(String newPattern, String options)`
`String`	`toString()`	Represents this instence in String.

Methods inherited from class java.lang.Object

clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Details
- RegularExpression
  
  public RegularExpression(String regex) throws ParseException
  
  Creates a new RegularExpression instance.
  
  Parameters:
  
  regex - A regular expression
  
  Throws:
  
  org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
  
  ParseException
- RegularExpression
  
  public RegularExpression(String regex, String options) throws ParseException
  
  Creates a new RegularExpression instance with options.
  
  Parameters:
  
  regex - A regular expression
  
  options - A String consisted of "i" "m" "s" "u" "w" "," "X"
  
  Throws:
  
  org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
  
  ParseException
Method Details
- matches
  
  public boolean matches(char[] target)
  
  Checks whether the target text contains this pattern or not.
  
  Returns:
  
  true if the target is matched to this regular expression.
- matches
  
  public boolean matches(char[] target, int start, int end)
  
  Checks whether the target text contains this pattern in specified range or not.
  
  Parameters:
  
  start - Start offset of the range.
  
  end - End offset +1 of the range.
  
  Returns:
  
  true if the target is matched to this regular expression.
- matches
  
  public boolean matches(char[] target, Match match)
  
  Checks whether the target text contains this pattern or not.
  
  Parameters:
  
  match - A Match instance for storing matching result.
  
  Returns:
  
  Offset of the start position in target; or -1 if not match.
- matches
  
  public boolean matches(char[] target, int start, int end, Match match)
  
  Checks whether the target text contains this pattern in specified range or not.
  
  Parameters:
  
  start - Start offset of the range.
  
  end - End offset +1 of the range.
  
  match - A Match instance for storing matching result.
  
  Returns:
  
  Offset of the start position in target; or -1 if not match.
- matches
  
  public boolean matches(String target)
  
  Checks whether the target text contains this pattern or not.
  
  Returns:
  
  true if the target is matched to this regular expression.
- matches
  
  public boolean matches(String target, int start, int end)
  
  Checks whether the target text contains this pattern in specified range or not.
  
  Parameters:
  
  start - Start offset of the range.
  
  end - End offset +1 of the range.
  
  Returns:
  
  true if the target is matched to this regular expression.
- matches
  
  public boolean matches(String target, Match match)
  
  Checks whether the target text contains this pattern or not.
  
  Parameters:
  
  match - A Match instance for storing matching result.
  
  Returns:
  
  Offset of the start position in target; or -1 if not match.
- matches
  
  public boolean matches(String target, int start, int end, Match match)
  
  Checks whether the target text contains this pattern in specified range or not.
  
  Parameters:
  
  start - Start offset of the range.
  
  end - End offset +1 of the range.
  
  match - A Match instance for storing matching result.
  
  Returns:
  
  Offset of the start position in target; or -1 if not match.
- matches
  
  public boolean matches(CharacterIterator target)
  
  Checks whether the target text contains this pattern or not.
  
  Returns:
  
  true if the target is matched to this regular expression.
- matches
  
  public boolean matches(CharacterIterator target, Match match)
  
  Checks whether the target text contains this pattern or not.
  
  Parameters:
  
  match - A Match instance for storing matching result.
  
  Returns:
  
  Offset of the start position in target; or -1 if not match.
- setPattern
  
  public void setPattern(String newPattern) throws ParseException
  
  Throws:
  
  ParseException
- setPattern
  
  public void setPattern(String newPattern, String options) throws ParseException
  
  Throws:
  
  ParseException
- getPattern
  
  public String getPattern()
- toString
  
  public String toString()
  
  Represents this instence in String.
  
  Overrides:
  
  toString in class Object
- getOptions
  
  public String getOptions()
  
  Returns a option string. The order of letters in it may be different from a string specified in a constructor or setPattern().
  
  See Also:
  
  RegularExpression(java.lang.String,java.lang.String), setPattern(java.lang.String,java.lang.String)
- equals
  
  public boolean equals(Object obj)
  
  Return true if patterns are the same and the options are equivalent.
  
  Overrides:
  
  equals in class Object
- hashCode
  
  public int hashCode()
  
  Overrides:
  
  hashCode in class Object
- getNumberOfGroups
  
  public int getNumberOfGroups()
  
  Return the number of regular expression groups. This method returns 1 when the regular expression has no capturing-parenthesis.

Class RegularExpression