Issue 4044: Confusing requirements for std::print on POSIX platforms

4044. Confusing requirements for `std::print` on POSIX platforms

Section: 31.7.10 [print.fun] Status: WP Submitter: Jonathan Wakely Opened: 2024-01-24 Last modified: 2024-11-28

Priority: 3

View all other issues in [print.fun].

View all issues with WP status.

Discussion:

The effects for vprintf_unicode say:

If stream refers to a terminal capable of displaying Unicode, writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined and implementations are encouraged to diagnose it. Otherwise writes out to stream unchanged. If the native Unicode API is used, the function flushes stream before writing out.

[Note 1: On POSIX and Windows, stream referring to a terminal means that, respectively, isatty(fileno(stream)) and GetConsoleMode(_get_osfhandle(_fileno(stream)), ...) return nonzero. — end note]

[Note 2: On Windows, the native Unicode API is WriteConsoleW. — end note]

-8- Throws: [...]

-9- Recommended practice: If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion.

The very explicit mention of isatty for POSIX platforms has confused at least two implementers into thinking that we're supposed to use isatty, and supposed to do something differently based on what it returns. That seems consistent with the nearly identical wording in 28.5.2.2 [format.string.std] paragraph 12, which says "Implementations should use either UTF-8, UTF-16, or UTF-32, on platforms capable of displaying Unicode text in a terminal" and then has a note explicitly saying this is the case for Windows-based and many POSIX-based operating systems. So it seems clear that POSIX platforms are supposed to be considered to have "a terminal capable of displaying Unicode text", and so std::print should use isatty and then use a native Unicode API, and diagnose invalid code units.

This is a problem however, because isatty needs to make a system call on Linux, adding 500ns to every std::print call. This results in a 10x slowdown on Linux, where std::print can take just 60ns without the isatty check.

From discussions with Tom Honermann I learned that the "native Unicode API" wording is only relevant on Windows. This makes sense, because for POSIX platforms, writing to a terminal is done using the usual stdio functions, so there's no need to treat a terminal differently to any other file stream. And substitution of invalid code units with u+fffd is recommended for Windows because that's what typical modern terminals do on POSIX platforms, so requiring the implementation to do that on Windows gives consistent behaviour. But the implementation doesn't need to do anything to make that happen with a POSIX terminal, it happens anyway. So the isatty check is unnecessary for POSIX platforms, and the note mentioning it just causes confusion and has no benefit.

Secondly, there initially seems to be a contradiction between the "implementations are encouraged to diagnose it" wording and the later Recommended practice. In fact, there's no contradiction because the native Unicode API might accept UTF-8 and therefore require no transcoding, and so the Recommended practice wouldn't apply. The intention is that diagnosing invalid UTF-8 is still desirable in this case, but how should it be diagnosed? By writing an error to the terminal alongside the formatted string? Or by substituting u+fffd maybe? If the latter is the intention, why is one suggestion in the middle of the Effects, and one given as Recommended practice?

The proposed resolution attempts to clarify that a "native Unicode API" is only needed if that's how you display Unicode on the terminal. It also moves the flushing requirement to be adjacent to the other requirements for systems using a native Unicode API instead of on its own later in the paragraph. And the suggestion to diagnose invalid code units is moved into the Recommended practice and clarified that it's only relevant if using a native Unicode API. I'm still not entirely happy with encouragement to diagnose invalid code units without giving any clue as to how that should be done. What does it mean to diagnose something at runtime? That's novel for the C++ standard. The way it's currently phrased seems to imply something other than u+fffd substitution should be done, although that seems the most obvious implementation to me.

[2024-03-12; Reflector poll]

Set priority to 3 after reflector poll and send to SG16.

Previous resolution [SUPERSEDED]:

This wording is relative to N4971.
Modify 31.7.6.3.5 [ostream.formatted.print] as indicated:
void vprint_unicode(ostream& os, string_view fmt, format_args args);
void vprint_nonunicode(ostream& os, string_view fmt, format_args args);
-3- Effects: Behaves as a formatted output function (31.7.6.3.1 [ostream.formatted.reqmts]) of os, except that:

(3.1) – failure to generate output is reported as specified below, and

(3.2) – any exception thrown by the call to vformat is propagated without regard to the value of os.exceptions() and without turning on ios_base::badbit in the error state of os.

After constructing a sentry object, the function initializes an automatic variable via
  string out = vformat(os.getloc(), fmt, args); 
If the function is vprint_unicode and os is a stream that refers to a terminal capable of displaying Unicode via a native Unicode API, which is determined in an implementation-defined manner, flushes os and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined ~~and implementations are encouraged to diagnose it~~. ~~If the native Unicode API is used, the function flushes os before writing out.~~ Otherwise, (if os is not such a stream or the function is vprint_nonunicode), inserts the character sequence [out.begin(),out.end()) into os. If writing to the terminal or inserting into os fails, calls os.setstate(ios_base::badbit) (which may throw ios_base::failure).

-4- Recommended practice: For vprint_unicode, if invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If invoking the native Unicode API does not require transcoding, implementations are encouraged to diagnose invalid code units.
Modify 31.7.10 [print.fun] as indicated:
void vprint_unicode(FILE* stream, string_view fmt, format_args args);
-6- Preconditions: stream is a valid pointer to an output C stream.

-7- Effects: The function initializes an automatic variable via
  string out = vformat(fmt, args); 
If stream refers to a terminal capable of displaying Unicode via a native Unicode API, flushes stream and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined ~~and implementations are encouraged to diagnose it~~. Otherwise writes out to stream unchanged. ~~If the native Unicode API is used, the function flushes stream before writing out.~~

[Note 1: On ~~POSIX and~~ Windows, the native Unicode API is WriteConsoleW and stream referring to a terminal means that~~, respectively, isatty(fileno(stream)) and~~ GetConsoleMode(_get_osfhandle(_fileno(stream)), ...) return nonzero. — end note]

~~[Note 2: On Windows, the native Unicode API is WriteConsoleW. — end note]~~

-8- Throws: [...]

-9- Recommended practice: If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If invoking the native Unicode API does not require transcoding, implementations are encouraged to diagnose invalid code units.

[2024-03-12; Jonathan updates wording based on SG16 feedback]

SG16 reviewed the issue and approved the proposed resolution with the wording about diagnosing invalid code units removed.

SG16 favors removing the following text (both occurrences) from the proposed wording. This is motivated by a lack of understanding regarding what it means to diagnose such invalid code unit sequences given that the input is likely provided at run-time.

If invoking the native Unicode API does not require transcoding, implementations are encouraged to diagnose invalid code units.

Some concern was expressed regarding how the current wording is structured. At present, the wording leads with a Windows centric perspective; if the stream refers to a terminal ... use the native Unicode API ... otherwise write code units to the stream. It might be an improvement to structure the wording such that use of the native Unicode API is presented as a fallback for implementations that require its use when writing directly to the stream is not sufficient to produce desired results. In other words, the wording should permit direct writing to the stream even when the stream is directed to a terminal and a native Unicode API is available when the implementation has reason to believe that doing so will produce the correct results. For example, Microsoft's HoloLens has a Windows based operating system, but it only supports use of UTF-8 as the system code page and therefore would not require the native Unicode API bypass; implementations for it could avoid the overhead of checking to see if the stream is directed to a console.

Previous resolution [SUPERSEDED]:

This wording is relative to N4971.
Modify 31.7.6.3.5 [ostream.formatted.print] as indicated:
void vprint_unicode(ostream& os, string_view fmt, format_args args);
void vprint_nonunicode(ostream& os, string_view fmt, format_args args);
-3- Effects: Behaves as a formatted output function (31.7.6.3.1 [ostream.formatted.reqmts]) of os, except that:

(3.1) – failure to generate output is reported as specified below, and

(3.2) – any exception thrown by the call to vformat is propagated without regard to the value of os.exceptions() and without turning on ios_base::badbit in the error state of os.

After constructing a sentry object, the function initializes an automatic variable via
  string out = vformat(os.getloc(), fmt, args); 
If the function is vprint_unicode and os is a stream that refers to a terminal that is only capable of displaying Unicode via a native Unicode API, which is determined in an implementation-defined manner, flushes os and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined ~~and implementations are encouraged to diagnose it~~. ~~If the native Unicode API is used, the function flushes os before writing out.~~ Otherwise, (if os is not such a stream or the function is vprint_nonunicode), inserts the character sequence [out.begin(),out.end()) into os. If writing to the terminal or inserting into os fails, calls os.setstate(ios_base::badbit) (which may throw ios_base::failure).

-4- Recommended practice: For vprint_unicode, if invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion.
Modify 31.7.10 [print.fun] as indicated:
void vprint_unicode(FILE* stream, string_view fmt, format_args args);
-6- Preconditions: stream is a valid pointer to an output C stream.

-7- Effects: The function initializes an automatic variable via
  string out = vformat(fmt, args); 
If stream refers to a terminal that is only capable of displaying Unicode via a native Unicode API, flushes stream and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined ~~and implementations are encouraged to diagnose it~~. Otherwise writes out to stream unchanged. ~~If the native Unicode API is used, the function flushes stream before writing out.~~

[Note 1: On ~~POSIX and~~ Windows, the native Unicode API is WriteConsoleW and stream referring to a terminal means that~~, respectively, isatty(fileno(stream)) and~~ GetConsoleMode(_get_osfhandle(_fileno(stream)), ...) return nonzero. — end note]

~~[Note 2: On Windows, the native Unicode API is WriteConsoleW. — end note]~~

-8- Throws: [...]

-9- Recommended practice: If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion.

[2024-03-19; Tokyo: Jonathan updates wording after LWG review]

Split the Effects: into separate bullets for the "native Unicode API" and "otherwise" cases. Remove the now-redundant "if os is not such a stream" parenthesis.

[St. Louis 2024-06-24; move to Ready.]

[Wrocław 2024-11-23; Status changed: Voting → WP.]

Proposed resolution:

This wording is relative to N4971.

Modify 31.7.6.3.5 [ostream.formatted.print] as indicated:
```
void vprint_unicode(ostream& os, string_view fmt, format_args args);
void vprint_nonunicode(ostream& os, string_view fmt, format_args args);
```
-3- Effects: Behaves as a formatted output function (31.7.6.3.1 [ostream.formatted.reqmts]) of os, except that:
1. (3.1) – failure to generate output is reported as specified below, and
2. (3.2) – any exception thrown by the call to vformat is propagated without regard to the value of os.exceptions() and without turning on ios_base::badbit in the error state of os.
-?- After constructing a sentry object, the function initializes an automatic variable via
```
  string out = vformat(os.getloc(), fmt, args); 
```
1. (?.1) – If the function is vprint_unicode and os is a stream that refers to a terminal that is only capable of displaying Unicode via a native Unicode API, which is determined in an implementation-defined manner, flushes os and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined ~~and implementations are encouraged to diagnose it.~~ ~~If the native Unicode API is used, the function flushes os before writing out~~.
2. (?.2) – Otherwise, ~~(if os is not such a stream or the function is vprint_nonunicode),~~ inserts the character sequence [out.begin(),out.end()) into os.
-?- If writing to the terminal or inserting into os fails, calls os.setstate(ios_base::badbit) (which may throw ios_base::failure).

-4- Recommended practice: For vprint_unicode, if invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion.
Modify 31.7.10 [print.fun] as indicated:
```
void vprint_unicode(FILE* stream, string_view fmt, format_args args);
```
-6- Preconditions: stream is a valid pointer to an output C stream.

-7- Effects: The function initializes an automatic variable via
```
  string out = vformat(fmt, args); 
```
1. (7.1) – If stream refers to a terminal that is only capable of displaying Unicode via a native Unicode API, flushes stream and then writes out to the terminal using the native Unicode API; if out contains invalid code units, the behavior is undefined ~~and implementations are encouraged to diagnose it~~.
2. (7.2) – Otherwise writes out to stream unchanged.
~~If the native Unicode API is used, the function flushes stream before writing out.~~

[Note 1: On ~~POSIX and~~ Windows, the native Unicode API is WriteConsoleW and stream referring to a terminal means that~~, respectively, isatty(fileno(stream)) and~~ GetConsoleMode(_get_osfhandle(_fileno(stream)), ...) returns nonzero. — end note]

~~[Note 2: On Windows, the native Unicode API is WriteConsoleW. — end note]~~

-8- Throws: [...]

-9- Recommended practice: If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion.

4044. Confusing requirements for std::print on POSIX platforms

4044. Confusing requirements for `std::print` on POSIX platforms