3965. Incorrect example in [format.string.escaped] p3 for formatting of combining characters

Section: 28.5.6.5 [format.string.escaped] Status: WP Submitter: Tom Honermann Opened: 2023-07-31 Last modified: 2023-11-22

Priority: Not Prioritized

View all issues with WP status.

Discussion:

The C++23 DIS contains the following example in 28.5.6.5 [format.string.escaped] p3. (This example does not appear in the most recent N4950 WP or on https://eel.is/c++draft because the project editor has not yet merged changes needed to support rendering of some of the characters involved).

string s6 = format("[{:?}]", "🤷‍♂️"); // s6 has value: ["🤷\u{200d}♂\u{fe0f}"]

The character to be formatted (🤷‍♂️) consists of the following sequence of code points in the order presented:

28.5.6.5 [format.string.escaped] bullet 2.2.1 specifies which code points are to be formatted as a \u{hex-digit-sequence} escape sequence:

  1. (2.2.1) — If X encodes a single character C, then:

    1. (2.2.1.1) — If C is one of the characters in Table 75 [tab:format.escape.sequences], then the two characters shown as the corresponding escape sequence are appended to E.

    2. (2.2.1.2) — Otherwise, if C is not U+0020 SPACE and

      1. (2.2.1.2.1) — CE is UTF-8, UTF-16, or UTF-32 and C corresponds to a Unicode scalar value whose Unicode property General_Category has a value in the groups Separator (Z) or Other (C), as described by UAX #44 of the Unicode Standard, or

      2. (2.2.1.2.2) — CE is UTF-8, UTF-16, or UTF-32 and C corresponds to a Unicode scalar value with the Unicode property Grapheme_Extend=Yes as described by UAX #44 of the Unicode Standard and C is not immediately preceded in S by a character P appended to E without translation to an escape sequence, or

      3. (2.2.1.2.3) — CE is neither UTF-8, UTF-16, nor UTF-32 and C is one of an implementation-defined set of separator or non-printable characters

      then the sequence \u{hex-digit-sequence} is appended to E, where hex-digit-sequence is the shortest hexadecimal representation of C using lower-case hexadecimal digits.

    3. (2.2.1.3) — Otherwise, C is appended to E.

The example is not consistent with the above specification for the final code point. U+FE0F is a single character, is not one of the characters in Table 75, is not U+0020, has a General_Category of Nonspacing Mark (Mn) which is neither Z nor C, has Grapheme_Extend=Yes but the prior character (U+2642) is not formatted as an escape sequence, and is not one of an implementation-defined set of separator or non-printable characters (for the purposes of this example; the example assumes a UTF-8 encoding). Thus, formatting for this character falls to the last bullet point and the character should be appended as is (without translation to an escape sequence). Since this character is a combining character, it should combine with the previous character and thus alter the appearance of U+2642 (thus producing "♂️" instead of "♂\u{fe0f}").

[2023-10-27; Reflector poll]

Set status to Tentatively Ready after six votes in favour during reflector poll.

[2023-11-11 Approved at November 2023 meeting in Kona. Status changed: Voting → WP.]

Proposed resolution:

This wording is relative to N4950 plus missing editorial pieces from P2286R8.

  1. Modify the example following 28.5.6.5 [format.string.escaped] p3 as indicated:

    [Drafting note: The presented example was voted in as part of P2286R8 during the July 2022 Virtual Meeting but is not yet accessible in the most recent working draft N4950.

    Note that the final character (♂️) is composed from the two code points U+2642 and U+FE0F. ]

    string s6 = format("[{:?}]", "🤷‍♂️"); // s6 has value: ["🤷\u{200d}♂\u{fe0f}"]["🤷\u{200d}♂️"]