2018. [CD] regex_traits::isctype Returns clause is wrong

Section: 28.6.6 [re.traits] Status: C++14 Submitter: Jonathan Wakely Opened: 2010-11-16 Last modified: 2016-01-28

Priority: Not Prioritized

View all other issues in [re.traits].

View all issues with C++14 status.

Discussion:

Addresses GB 10

28.6.6 [re.traits] p. 12 says:

returns true if f bitwise or'ed with the result of calling lookup_classname with an iterator pair that designates the character sequence "w" is not equal to 0 and c == '_'

If the bitmask value corresponding to "w" has a non-zero value (which it must do) then the bitwise or with any value is also non-zero, and so isctype('_', f) returns true for any f. Obviously this is wrong, since '_' is not in every ctype category.

There's a similar problem with the following phrases discussing the "blank" char class.

[2011-05-06: Jonathan Wakely comments and provides suggested wording]

DR 2019 added isblank support to <locale> which simplifies the definition of regex_traits::isctype by removing the special case for the "blank" class.

My suggestion for 2018 is to add a new table replacing the lists of recognized names in the Remarks clause of regex_traits::lookup_classname. I then refer to that table in the Returns clause of regex_traits::isctype to expand on the "in an unspecified manner" wording which is too vague. The conversion can now be described using the "is set" term defined by 16.3.3.3.3 [bitmask.types] and the new table to convey the intented relationship between e.g. [[:digit:]] and ctype_base::digit, which is not actually stated in the FDIS.

The effects of isctype can then most easily be described in code, given an "exposition only" function prototype to do the not-quite-so-unspecified conversion from char_class_type to ctype_base::mask.

The core of LWG 2018 is the "bitwise or'ed" wording which gives the wrong result, always evaluating to true for all values of f. That is replaced by the condition (f&x) == x where x is the result of calling lookup_classname with "w". I believe that's necessary, because the "w" class could be implemented by an internal "underscore" class i.e. x = _Alnum|_Underscore in which case (f&x) != 0 would give the wrong result when f==_Alnum.

The proposed resolution also makes use of ctype::widen which addresses the problem that the current wording only talks about "w" and '_' which assumes charT is char. There's still room for improvement here: the regex grammar in 28.6.12 [re.grammar] says that the class names in the table should always be recognized, implying that e.g. U"digit" should be recognized by regex_traits<char32_t>, but the specification of regex_traits::lookup_classname doesn't cover that, only mentioning char and wchar_t. Maybe the table should not distinguish narrow and wide strings, but should just have one column and add wording to say that regex_traits widens the name as if by using use_facet<ctype<charT>>::widen().

Another possible improvement would be to allow additional implementation-defined extensions in isctype. An implementation is allowed to support additional class names in lookup_classname, e.g. [[:octdigit:]] for [0-7] or [[:bindigit:]] for [01], but the current definition of isctype provides no way to use them unless ctype_base::mask also supports them.

[2011-05-10: Alberto and Daniel perform minor fixes in the P/R]

[ 2011 Bloomington ]

Consensus that this looks to be a correct solution, and the presentation as a table is a big improvement.

Concern that the middle section wording is a little muddled and confusing, Stefanus volunteered to reword.

[ 2013-09 Chicago ]

Stefanus provides improved wording (replaced below)

[ 2013-09 Chicago ]

Move as Immediate after reviewing Stefanus's revised wording, apply the new wording to the Working Paper.

Proposed resolution:

This wording is relative to the FDIS.

  1. Modify 28.6.6 [re.traits] p. 10 as indicated:

    template <class ForwardIterator>
      char_class_type lookup_classname(
        ForwardIterator first, ForwardIterator last, bool icase = false) const;
    

    -9- Returns: an unspecified value that represents the character classification named by the character sequence designated by the iterator range [first,last). If the parameter icase is true then the returned mask identifies the character classification without regard to the case of the characters being matched, otherwise it does honor the case of the characters being matched.(footnote 335) The value returned shall be independent of the case of the characters in the character sequence. If the name is not recognized then returns a value that compares equal to 0.

    -10- Remarks: For regex_traits<char>, at least the names "d", "w", "s", "alnum", "alpha", "blank", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper" and "xdigit"narrow character names in Table X shall be recognized. For regex_traits<wchar_t>, at least the names L"d", L"w", L"s", L"alnum", L"alpha", L"blank", L"cntrl", L"digit", L"graph", L"lower", L"print", L"punct", L"space", L"upper" and L"xdigit"wide character names in Table X shall be recognized.

  2. Modify 28.6.6 [re.traits] p. 12 as indicated:

    bool isctype(charT c, char_class_type f) const;
    

    -11- Effects: Determines if the character c is a member of the character classification represented by f.

    -12- Returns: Converts f into a value m of type std::ctype_base::mask in an unspecified manner, and returns true if use_facet<ctype<charT> >(getloc()).is(m, c) is true. Otherwise returns true if f bitwise or'ed with the result of calling lookup_classname with an iterator pair that designates the character sequence "w" is not equal to 0 and c == '_', or if f bitwise or'ed with the result of calling lookup_classname with an iterator pair that designates the character sequence "blank" is not equal to 0 and c is one of an implementation-defined subset of the characters for which isspace(c, getloc()) returns true, otherwise returns false. Given an exposition-only function prototype

    
      template<class C>
       ctype_base::mask convert(typename regex_traits<C>::char_class_type f);
    
    

    that returns a value in which each ctype_base::mask value corresponding to a value in f named in Table X is set, then the result is determined as if by:

    
    ctype_base::mask m = convert<charT>(f);
    const ctype<charT>& ct = use_facet<ctype<charT>>(getloc());
    if (ct.is(m, c)) {
      return true;
    } else if (c == ct.widen('_')) {
      charT w[1] = { ct.widen('w') };
      char_class_type x = lookup_classname(w, w+1);
      
      return (f&x) == x;
    } else {
      return false;
    } 
    
    

    [Example:

    
    regex_traits<char> t;
    string d("d");
    string u("upper");
    regex_traits<char>::char_class_type f;
    f = t.lookup_classname(d.begin(), d.end());
    f |= t.lookup_classname(u.begin(), u.end());
    ctype_base::mask m = convert<char>(f); // m == ctype_base::digit|ctype_base::upper
    

    end example]

    [Example:

    
    regex_traits<char> t;
    string w("w");
    regex_traits<char>::char_class_type f;
    f = t.lookup_classname(w.begin(), w.end());
    t.isctype('A', f); // returns true
    t.isctype('_', f); // returns true
    t.isctype(' ', f); // returns false
    

    end example]

  3. At the end of 28.6.6 [re.traits] add a new "Table X — Character class names and corresponding ctype masks":

    Table X — Character class names and corresponding ctype masks
    Narrow character name Wide character name Corresponding ctype_base::mask value
    "alnum" L"alnum" ctype_base::alnum
    "alpha" L"alpha" ctype_base::alpha
    "blank" L"blank" ctype_base::blank
    "cntrl" L"cntrl" ctype_base::cntrl
    "digit" L"digit" ctype_base::digit
    "d" L"d" ctype_base::digit
    "graph" L"graph" ctype_base::graph
    "lower" L"lower" ctype_base::lower
    "print" L"print" ctype_base::print
    "punct" L"punct" ctype_base::punct
    "space" L"space" ctype_base::space
    "s" L"s" ctype_base::space
    "upper" L"upper" ctype_base::upper
    "w" L"w" ctype_base::alnum
    "xdigit" L"xdigit" ctype_base::xdigit