30 Regular expressions library [re]

30.12 Modified ECMAScript regular expression grammar [re.grammar]

The regular expression grammar recognized by basic_­regex objects constructed with the ECMAScript flag is that specified by ECMA-262, except as specified below.
Objects of type specialization of basic_­regex store within themselves a default-constructed instance of their traits template parameter, henceforth referred to as traits_­inst.
This traits_­inst object is used to support localization of the regular expression; basic_­regex member functions shall not call any locale dependent C or C++ API, including the formatted string input functions.
Instead they shall call the appropriate traits member function to achieve the required effect.
The following productions within the ECMAScript grammar are modified as follows:
ClassAtom
-
ClassAtomNoDash
ClassAtomExClass
ClassAtomCollatingElement
ClassAtomEquivalence
IdentityEscape
SourceCharacter but not c
The following new productions are then added:
ClassAtomExClass
[: ClassName :]
ClassAtomCollatingElement
[. ClassName .]
ClassAtomEquivalence
[= ClassName =]
ClassName
ClassNameCharacter
ClassNameCharacter ClassName
ClassNameCharacter
SourceCharacter but not one of . or = or :
The productions ClassAtomExClass, ClassAtomCollatingElement and ClassAtomEquivalence provide functionality equivalent to that of the same features in regular expressions in POSIX.
The regular expression grammar may be modified by any regex_­constants​::​syntax_­option_­type flags specified when constructing an object of type specialization of basic_­regex according to the rules in Table 136.
A ClassName production, when used in ClassAtomExClass, is not valid if traits_­inst.lookup_­classname returns zero for that name.
The names recognized as valid ClassNames are determined by the type of the traits class, but at least the following names shall be recognized: alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit, d, s, w.
In addition the following expressions shall be equivalent:
\d and [[:digit:]]

\D and [^[:digit:]]

\s and [[:space:]]

\S and [^[:space:]]

\w and [_[:alnum:]]

\W and [^_[:alnum:]]
A ClassName production when used in a ClassAtomCollatingElement production is not valid if the value returned by traits_­inst.lookup_­collatename for that name is an empty string.
The results from multiple calls to traits_­inst.lookup_­classname can be bitwise OR'ed together and subsequently passed to traits_­inst.isctype.
A ClassName production when used in a ClassAtomEquivalence production is not valid if the value returned by traits_­inst.lookup_­collatename for that name is an empty string or if the value returned by traits_­inst​.transform_­primary for the result of the call to traits_­inst.lookup_­collatename is an empty string.
When the sequence of characters being transformed to a finite state machine contains an invalid class name the translator shall throw an exception object of type regex_­error.
If the CV of a UnicodeEscapeSequence is greater than the largest value that can be held in an object of type charT the translator shall throw an exception object of type regex_­error.
[Note 1:
This means that values of the form "uxxxx" that do not fit in a character are invalid.
— end note]
Where the regular expression grammar requires the conversion of a sequence of characters to an integral value, this is accomplished by calling traits_­inst.value.
The behavior of the internal finite state machine representation when used to match a sequence of characters is as described in ECMA-262.
The behavior is modified according to any match_­flag_­type flags ([re.matchflag]) specified when using the regular expression object in one of the regular expression algorithms ([re.alg]).
The behavior is also localized by interaction with the traits class template parameter as follows:
  • During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:
    • if (flags() & regex_­constants​::​icase) the two characters are equal if traits_­inst.translate_­nocase(c) == traits_­inst.translate_­nocase(d);
    • otherwise, if flags() & regex_­constants​::​collate the two characters are equal if traits_­inst​.translate(c) == traits_­inst​.translate(d);
    • otherwise, the two characters are equal if c == d.
  • During matching of a regular expression finite state machine against a sequence of characters, comparison of a collating element range c1-c2 against a character c is conducted as follows: if flags() & regex_­constants​::​collate is false then the character c is matched if c1 <= c && c <= c2, otherwise c is matched in accordance with the following algorithm: string_type str1 = string_type(1, flags() & icase ? traits_inst.translate_nocase(c1) : traits_inst.translate(c1)); string_type str2 = string_type(1, flags() & icase ? traits_inst.translate_nocase(c2) : traits_inst.translate(c2)); string_type str = string_type(1, flags() & icase ? traits_inst.translate_nocase(c) : traits_inst.translate(c)); return traits_inst.transform(str1.begin(), str1.end()) <= traits_inst.transform(str.begin(), str.end()) && traits_inst.transform(str.begin(), str.end()) <= traits_inst.transform(str2.begin(), str2.end());
  • During matching of a regular expression finite state machine against a sequence of characters, testing whether a collating element is a member of a primary equivalence class is conducted by first converting the collating element and the equivalence class to sort keys using traits​::​transform_­primary, and then comparing the sort keys for equality.
  • During matching of a regular expression finite state machine against a sequence of characters, a character c is a member of a character class designated by an iterator range [first, last) if traits_­inst.isctype(c, traits_­inst.lookup_­classname(first, last, flags() & icase)) is true.
See also: ECMA-262 15.10