char_traits<char16_t>::eof
is a valid UTF-16 code unitSection: 27.2.4.4 [char.traits.specializations.char16.t] Status: New Submitter: Jonathan Wakely Opened: 2017-05-05 Last modified: 2019-04-02
Priority: 3
View all other issues in [char.traits.specializations.char16.t].
View all issues with New status.
Discussion:
The standard requires that char_traits<char16_t>::int_type
is
uint_least16_t
, so when that has the same representation as char16_t
there are no bits left to represent the eof
value.
— The member
eof()
shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit.
Existing practice is to use the "noncharacter" u'\uffff'
for this
value, but the Unicode spec is clear that U+FFFF
and other
noncharacters are valid, and their appearance in a UTF-16 string does
not make it ill-formed. See here and
here:
The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.
In practice this means there's no way to tell if
basic_streambuf<char16_t>::sputc(u'\uffff')
succeeded or not. If it
can insert the character it returns to_int_type(u'\uffff')
and
otherwise it returns eof()
, which is the same value.
char_traits<char16_t>::to_int_type(char_type c)
can be
defined to transform U+FFFF
into U+FFFD
, so that the invariant
eq_int_type(eof(), to_int_type(c)) == false
holds for any c
(and the
return value of sputc
will be distinct from eof
). I don't think any
implementation currently meets that invariant.
I think at the very least we need to correct the statement "The member
eof()
shall return an implementation-defined constant that cannot
appear as a valid UTF-16 code unit", because there are no such
constants if sizeof(uint_least16_t) == sizeof(char16_t)
.
This issue is closely related to LWG 1200, but there it's a
slightly different statement of the problem, and neither the
submitter's recommendation nor the proposed resolution solves this
issue here. It seems that was closed as NAD before the Unicode corrigendum
existed, so at the time our standard just gave "surprising results"
but wasn't strictly wrong. Now it makes a normative statement that
conflicts with Unicode.
[2017-07 Toronto Wed Issue Prioritization]
Priority 3
Proposed resolution: