std::formatter<std::filesystem::path>
Section: 31.12.6.9.2 [fs.path.fmtr.funcs] Status: SG16 Submitter: Jonathan Wakely Opened: 2024-04-19 Last modified: 2024-05-08
Priority: 2
View all issues with SG16 status.
Discussion:
31.12.6.9.2 [fs.path.fmtr.funcs] says:
IfcharT
ischar
,path::value_type
iswchar_t
, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.
This seems to mean that the Unicode substitutions are only done
for an escaped path, i.e. when the ?
option is used. Otherwise, the form
of transcoding is completely implementation-defined.
However, this makes no sense.
An escaped string will have no ill-formed subsequences, because they will
already have been replaced as per 28.5.6.5 [format.string.escaped]:
Otherwise (X is a sequence of ill-formed code units), each code unit U is appended to E in order as the sequence\x{hex-digit-sequence}
, wherehex-digit-sequence
is the shortest hexadecimal representation of U using lower-case hexadecimal digits.
So only unescaped strings can have ill-formed sequences by the time
we do transcoding to char
, but whether or not any
u+fffd substitution
occurs is just implementation-defined.
I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).
It does matter whether we escape first or perform substitutions first.
If we escape first then every code unit in an ill-formed sequence is
individually escaped as \x{hex-digit-sequence}
.
So an ill-formed sequence of two wchar_t
values will be escaped as
two \x{...}
strings, which are then transcoded to UTF-8.
If we transcode (with substitutions first) then the entire
ill-formed sequence is replaced with a single replacement character,
which will then be escaped as \x{fffd}
.
SG16 should be asked to confirm that escaping first is intended,
so that an escaped string shows the original invalid code units.
For a non-escaped string, we want the ill-formed sequence to be
formatted as �, which the proposed resolution tries to ensure.
[2024-05-08; Reflector poll]
Set priority to 2 after reflector poll.
Proposed resolution:
This wording is relative to N4981.
Modify 31.12.6.9.2 [fs.path.fmtr.funcs] as indicated:
template<class FormatContext> typename FormatContext::iterator format(const filesystem::path& p, FormatContext& ctx) const;
-5- Effects: Lets
bep.generic_string<filesystem::path::value_type>()
if theg
option is used, otherwisep.native()
. Writess
intoctx.out()
, adjusted according to the path-format-spec. IfcharT
ischar
,path::value_type
iswchar_t
, and the literal encoding is UTF-8, then theescaped path(possibly escaped) string is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard, Chapter 3.9 u+fffd Substitution in Conversion. IfcharT
andpath::value_type
are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.
transcoding of a formattedpath
whencharT
andpath::value_type
differ and not converting fromwchar_t
to UTF-8