4070. Transcoding by std::formatter<std::filesystem::path>

Section: 31.12.6.9.2 [fs.path.fmtr.funcs] Status: SG16 Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2024-05-08

Priority: 2

View all issues with SG16 status.

Discussion:

31.12.6.9.2 [fs.path.fmtr.funcs] says:

If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the ? option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced as per 28.5.6.5 [format.string.escaped]:

Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence \x{hex-digit-sequence}, where hex-digit-sequence is the shortest hexadecimal representation of U using lower-case hexadecimal digits.

So only unescaped strings can have ill-formed sequences by the time we do transcoding to char, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as \x{hex-digit-sequence}. So an ill-formed sequence of two wchar_t values will be escaped as two \x{...} strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as \x{fffd}. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

[2024-05-08; Reflector poll]

Set priority to 2 after reflector poll.

Proposed resolution:

This wording is relative to N4981.

  1. Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:

    
    template<class FormatContext>
      typename FormatContext::iterator
        format(const filesystem::path& p, FormatContext& ctx) const;
    
    -5- Effects: Let s be p.generic_string<filesystem::path::value_type>() if the g option is used, otherwise p.native(). Writes s into ctx.out(), adjusted according to the path-format-spec. If charT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path (possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. If charT and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
  2. Modify the entry in the index of implementation-defined behavior as indicated:
    transcoding of a formatted path when charT and path::value_type differ and not converting from wchar_t to UTF-8