regex_traits::isctype
Returns clause is wrongSection: 28.6.6 [re.traits] Status: C++14 Submitter: Jonathan Wakely Opened: 2010-11-16 Last modified: 2016-01-28
Priority: Not Prioritized
View all other issues in [re.traits].
View all issues with C++14 status.
Discussion:
Addresses GB 10
28.6.6 [re.traits] p. 12 says:
returns true if
f
bitwise or'ed with the result of callinglookup_classname
with an iterator pair that designates the character sequence "w" is not equal to0
andc == '_'
If the bitmask value corresponding to "w" has a non-zero value (which
it must do) then the bitwise or with any value is also non-zero, and
so isctype('_', f)
returns true for any f
. Obviously this is wrong,
since '_'
is not in every ctype
category.
There's a similar problem with the following phrases discussing the "blank" char class.
[2011-05-06: Jonathan Wakely comments and provides suggested wording]
DR 2019 added isblank
support to <locale>
which simplifies the
definition of regex_traits::isctype
by removing the special case for the "blank" class.
regex_traits::lookup_classname
.
I then refer to that table in the Returns clause of regex_traits::isctype
to expand on the "in an unspecified manner" wording which is too vague. The conversion
can now be described using the "is set" term defined by 16.3.3.3.3 [bitmask.types] and
the new table to convey the intented relationship between e.g.
[[:digit:]] and ctype_base::digit
, which is not actually stated in the
FDIS.
The effects of isctype
can then most easily be described in code,
given an "exposition only" function prototype to do the not-quite-so-unspecified conversion
from char_class_type
to ctype_base::mask
.
The core of LWG 2018 is the "bitwise or'ed" wording which gives the
wrong result, always evaluating to true for all values of f
. That is
replaced by the condition (f&x) == x
where x
is the result of calling
lookup_classname
with "w". I believe that's necessary, because the
"w" class could be implemented by an internal "underscore" class i.e.
x = _Alnum|_Underscore
in which case (f&x) != 0
would give the wrong
result when f==_Alnum
.
The proposed resolution also makes use of ctype::widen
which addresses
the problem that the current wording only talks about "w" and '_' which assumes
charT
is char. There's still room for improvement here:
the regex grammar in 28.6.12 [re.grammar] says that the class names in the
table should always be recognized, implying that e.g. U"digit" should
be recognized by regex_traits<char32_t>
, but the specification of
regex_traits::lookup_classname
doesn't cover that, only mentioning
char
and wchar_t
. Maybe the table should not distinguish narrow and
wide strings, but should just have one column and add wording to say
that regex_traits
widens the name as if by using use_facet<ctype<charT>>::widen()
.
Another possible improvement would be to allow additional
implementation-defined extensions in isctype
. An implementation is
allowed to support additional class names in lookup_classname
, e.g.
[[:octdigit:]] for [0-7] or [[:bindigit:]] for [01], but the current
definition of isctype provides no way to use them unless ctype_base::mask
also supports them.
[2011-05-10: Alberto and Daniel perform minor fixes in the P/R]
[ 2011 Bloomington ]
Consensus that this looks to be a correct solution, and the presentation as a table is a big improvement.
Concern that the middle section wording is a little muddled and confusing, Stefanus volunteered to reword.
[ 2013-09 Chicago ]
Stefanus provides improved wording (replaced below)
[ 2013-09 Chicago ]
Move as Immediate after reviewing Stefanus's revised wording, apply the new wording to the Working Paper.
Proposed resolution:
This wording is relative to the FDIS.
Modify 28.6.6 [re.traits] p. 10 as indicated:
template <class ForwardIterator> char_class_type lookup_classname( ForwardIterator first, ForwardIterator last, bool icase = false) const;-9- Returns: an unspecified value that represents the character classification named by the character sequence designated by the iterator range [
-10- Remarks: Forfirst
,last
). If the parametericase
is true then the returned mask identifies the character classification without regard to the case of the characters being matched, otherwise it does honor the case of the characters being matched.(footnote 335) The value returned shall be independent of the case of the characters in the character sequence. If the name is not recognized then returns a value that compares equal to0
.regex_traits<char>
, at least thenames "d", "w", "s", "alnum", "alpha", "blank", "cntrl", "digit", "graph", "lower", "print", "punct", "space", "upper" and "xdigit"narrow character names in Table X shall be recognized. Forregex_traits<wchar_t>
, at least thenames L"d", L"w", L"s", L"alnum", L"alpha", L"blank", L"cntrl", L"digit", L"graph", L"lower", L"print", L"punct", L"space", L"upper" and L"xdigit"wide character names in Table X shall be recognized.
Modify 28.6.6 [re.traits] p. 12 as indicated:
bool isctype(charT c, char_class_type f) const;-11- Effects: Determines if the character
-12- Returns:c
is a member of the character classification represented byf
.ConvertsGiven an exposition-only function prototypef
into a valuem
of typestd::ctype_base::mask
in an unspecified manner, and returns true ifuse_facet<ctype<charT> >(getloc()).is(m, c)
is true. Otherwise returns true iff
bitwise or'ed with the result of callinglookup_classname
with an iterator pair that designates the character sequence "w" is not equal to0
andc == '_'
, or iff
bitwise or'ed with the result of callinglookup_classname
with an iterator pair that designates the character sequence "blank" is not equal to0
andc
is one of an implementation-defined subset of the characters for whichisspace(c, getloc())
returns true, otherwise returns false.template<class C> ctype_base::mask convert(typename regex_traits<C>::char_class_type f);that returns a value in which each
ctype_base::mask
value corresponding to a value inf
named in Table X is set, then the result is determined as if by:ctype_base::mask m = convert<charT>(f); const ctype<charT>& ct = use_facet<ctype<charT>>(getloc()); if (ct.is(m, c)) { return true; } else if (c == ct.widen('_')) { charT w[1] = { ct.widen('w') }; char_class_type x = lookup_classname(w, w+1); return (f&x) == x; } else { return false; }[Example:
regex_traits<char> t; string d("d"); string u("upper"); regex_traits<char>::char_class_type f; f = t.lookup_classname(d.begin(), d.end()); f |= t.lookup_classname(u.begin(), u.end()); ctype_base::mask m = convert<char>(f); // m == ctype_base::digit|ctype_base::upper— end example]
[Example:
regex_traits<char> t; string w("w"); regex_traits<char>::char_class_type f; f = t.lookup_classname(w.begin(), w.end()); t.isctype('A', f); // returns true t.isctype('_', f); // returns true t.isctype(' ', f); // returns false— end example]
At the end of 28.6.6 [re.traits] add a new "Table X — Character class names and corresponding ctype masks":
Table X — Character class names and corresponding ctype masks Narrow character name Wide character name Corresponding ctype_base::mask
value"alnum"
L"alnum"
ctype_base::alnum
"alpha"
L"alpha"
ctype_base::alpha
"blank"
L"blank"
ctype_base::blank
"cntrl"
L"cntrl"
ctype_base::cntrl
"digit"
L"digit"
ctype_base::digit
"d"
L"d"
ctype_base::digit
"graph"
L"graph"
ctype_base::graph
"lower"
L"lower"
ctype_base::lower
"print"
L"print"
ctype_base::print
"punct"
L"punct"
ctype_base::punct
"space"
L"space"
ctype_base::space
"s"
L"s"
ctype_base::space
"upper"
L"upper"
ctype_base::upper
"w"
L"w"
ctype_base::alnum
"xdigit"
L"xdigit"
ctype_base::xdigit