Issue 2959: char_traits<char16_t>::eof is a valid UTF-16 code unit

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of New status.

2959. `char_traits<char16_t>::eof` is a valid UTF-16 code unit

Section: 27.2.4.4 [char.traits.specializations.char16.t] Status: New Submitter: Jonathan Wakely Opened: 2017-05-05 Last modified: 2019-04-02

Priority: 3

View all other issues in [char.traits.specializations.char16.t].

View all issues with New status.

Discussion:

The standard requires that char_traits<char16_t>::int_type is uint_least16_t, so when that has the same representation as char16_t there are no bits left to represent the eof value.

27.2.4.4 [char.traits.specializations.char16.t] says:

— The member eof() shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit.

Existing practice is to use the "noncharacter" u'\uffff' for this value, but the Unicode spec is clear that U+FFFF and other noncharacters are valid, and their appearance in a UTF-16 string does not make it ill-formed. See here and here:

The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

In practice this means there's no way to tell if basic_streambuf<char16_t>::sputc(u'\uffff') succeeded or not. If it can insert the character it returns to_int_type(u'\uffff') and otherwise it returns eof(), which is the same value.

I believe that char_traits<char16_t>::to_int_type(char_type c) can be defined to transform U+FFFF into U+FFFD, so that the invariant eq_int_type(eof(), to_int_type(c)) == false holds for any c (and the return value of sputc will be distinct from eof). I don't think any implementation currently meets that invariant.

I think at the very least we need to correct the statement "The member eof() shall return an implementation-defined constant that cannot appear as a valid UTF-16 code unit", because there are no such constants if sizeof(uint_least16_t) == sizeof(char16_t).

This issue is closely related to LWG 1200, but there it's a slightly different statement of the problem, and neither the submitter's recommendation nor the proposed resolution solves this issue here. It seems that was closed as NAD before the Unicode corrigendum existed, so at the time our standard just gave "surprising results" but wasn't strictly wrong. Now it makes a normative statement that conflicts with Unicode.

[2017-07 Toronto Wed Issue Prioritization]

Priority 3

Proposed resolution:

2959. char_traits<char16_t>::eof is a valid UTF-16 code unit

2959. `char_traits<char16_t>::eof` is a valid UTF-16 code unit