4044. Confusing requirements for std::print on POSIX platforms

Section: 31.7.10 [print.fun] Status: New Submitter: Jonathan Wakely Opened: 2024-01-24 Last modified: 2024-01-24 20:20:08 UTC

Priority: Not Prioritized

View other active issues in [print.fun].

View all other issues in [print.fun].

View all issues with New status.

Discussion:

The effects for vprintf_unicode say:

If stream refers to a terminal capable of displaying Unicode, writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined and implementations are encouraged to diagnose it. Otherwise writes out to stream unchanged. If the native Unicode API is used, the function flushes stream before writing out.

[Note 1: On POSIX and Windows, stream referring to a terminal means that, respectively, isatty(fileno(stream)) and GetConsoleMode(_get_osfhandle(_fileno(stream)), ...) return nonzero. — end note]

[Note 2: On Windows, the native Unicode API is WriteConsoleW. — end note]

-8- Throws: [...]

-9- Recommended practice: If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion.

The very explicit mention of isatty for POSIX platforms has confused at least two implementers into thinking that we're supposed to use isatty, and supposed to do something differently based on what it returns. That seems consistent with the nearly identical wording in 22.14.2.2 [format.string.std] paragraph 12, which says "Implementations should use either UTF-8, UTF-16, or UTF-32, on platforms capable of displaying Unicode text in a terminal" and then has a note explicitly saying this is the case for Windows-based and many POSIX-based operating systems. So it seems clear that POSIX platforms are supposed to be considered to have "a terminal capable of displaying Unicode text", and so std::print should use isatty and then use a native Unicode API, and diagnose invalid code units.

This is a problem however, because isatty needs to make a system call on Linux, adding 500ns to every std::print call. This results in a 10x slowdown on Linux, where std::print can take just 60ns without the isatty check.

From discussions with Tom Honermann I learned that the "native Unicode API" wording is only relevant on Windows. This makes sense, because for POSIX platforms, writing to a terminal is done using the usual stdio functions, so there's no need to treat a terminal differently to any other file stream. And substitution of invalid code units with u+fffd is recommended for Windows because that's what typical modern terminals do on POSIX platforms, so requiring the implementation to do that on Windows gives consistent behaviour. But the implementation doesn't need to do anything to make that happen with a POSIX terminal, it happens anyway. So the isatty check is unnecessary for POSIX platforms, and the note mentioning it just causes confusion and has no benefit.

Secondly, there initially seems to be a contradiction between the "implementations are encouraged to diagnose it" wording and the later Recommended practice. In fact, there's no contradiction because the native Unicode API might accept UTF-8 and therefore require no transcoding, and so the Recommended practice wouldn't apply. The intention is that diagnosing invalid UTF-8 is still desirable in this case, but how should it be diagnosed? By writing an error to the terminal alongside the formatted string? Or by substituting u+fffd maybe? If the latter is the intention, why is one suggestion in the middle of the Effects, and one given as Recommended practice?

The proposed resolution attempts to clarify that a "native Unicode API" is only needed if that's how you display Unicode on the terminal. It also moves the flushing requirement to be adjacent to the other requirements for systems using a native Unicode API instead of on its own later in the paragraph. And the suggestion to diagnose invalid code units is moved into the Recommended practice and clarified that it's only relevant if using a native Unicode API. I'm still not entirely happy with encouragement to diagnose invalid code units without giving any clue as to how that should be done. What does it mean to diagnose something at runtime? That's novel for the C++ standard. The way it's currently phrased seems to imply something other than u+fffd substitution should be done, although that seems the most obvious implementation to me.

Proposed resolution:

This wording is relative to N4971.

  1. Modify 31.7.6.3.5 [ostream.formatted.print] as indicated:

    void vprint_unicode(ostream& os, string_view fmt, format_args args);
    void vprint_nonunicode(ostream& os, string_view fmt, format_args args);
    

    -3- Effects: Behaves as a formatted output function (31.7.6.3.1 [ostream.formatted.reqmts]) of os, except that:

    1. (3.1) – failure to generate output is reported as specified below, and
    2. (3.2) – any exception thrown by the call to vformat is propagated without regard to the value of os.exceptions() and without turning on ios_base::badbit in the error state of os.

    After constructing a sentry object, the function initializes an automatic variable via

      string out = vformat(os.getloc(), fmt, args); 
    If the function is vprint_unicode and os is a stream that refers to a terminal capable of displaying Unicode via a native Unicode API, which is determined in an implementation-defined manner, flushes os and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined and implementations are encouraged to diagnose it. If the native Unicode API is used, the function flushes os before writing out. Otherwise, (is os is not such a stream or the function is vprint_nonunicode), inserts the character sequence [out.begin(),out.end()) into os. If writing to the terminal or inserting into os fails, calls os.setstate(ios_base::badbit) (which may throw ios_base::failure).

    -4- Recommended practice: For vprint_unicode, if invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If invoking the native Unicode API does not require transcoding, implementations are encouraged to diagnose invalid code units.

  2. Modify 31.7.10 [print.fun] as indicated:

    void vprint_unicode(FILE* stream, string_view fmt, format_args args);
    

    -6- Preconditions: stream is a valid pointer to an output C stream.

    -7- Effects: The function initializes an automatic variable via

      string out = vformat(fmt, args); 
    If stream refers to a terminal capable of displaying Unicode via a native Unicode API, flushes stream and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined and implementations are encouraged to diagnose it. Otherwise writes out to stream unchanged. If the native Unicode API is used, the function flushes stream before writing out.

    [Note 1: On POSIX and Windows, the native Unicode API is WriteConsoleW and stream referring to a terminal means that, respectively, isatty(fileno(stream)) and GetConsoleMode(_get_osfhandle(_fileno(stream)), ...) return nonzero. — end note]

    [Note 2: On Windows, the native Unicode API is WriteConsoleW. — end note]

    -8- Throws: [...]

    -9- Recommended practice: If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If invoking the native Unicode API does not require transcoding, implementations are encouraged to diagnose invalid code units.