2986. Handling of multi-character collating elements by the regex FSM is underspecified

Section: 32.12 [re.grammar] Status: New Submitter: Hubert Tong Opened: 2017-06-25 Last modified: 2017-07-12 01:35:02 UTC

Priority: 4

View other active issues in [re.grammar].

View all other issues in [re.grammar].

View all issues with New status.


In N4660 subclause 31.13 [re.grammar] paragraph 5:

The productions ClassAtomExClass, ClassAtomCollatingElement and ClassAtomEquivalence provide functionality equivalent to that of the same features in regular expressions in POSIX.

The broadness of the above statement makes it sound like it is merely a statement of intent; however, this appears to be a necessary normative statement insofar as identifying the general semantics to be associated with the syntactic forms identified. In any case, if it is meant for ClassAtomCollatingElement to provide functionality equivalent to a collating symbol in a POSIX bracket expression, multi-character collating elements need to be considered.

In [re.grammar] paragraph 14:

The behavior of the internal finite state machine representation when used to match a sequence of characters is as described in ECMA-262. The behavior is modified according to any match_flag_type flags specified when using the regular expression object in one of the regular expression algorithms. The behavior is also localized by interaction with the traits class template parameter as follows: [bullets 14.1 to 14.4]

In none of the bullets does the wording handle multi-character collating elements in a clear manner:

The ECMA-262 specification for ClassRanges also deals in characters.

[2017-07 Toronto Monday issue prioritization]

Priority 4

Proposed resolution: