codecvt<charN_t, char8_t, mbstate_t>
incorrectly added to localeSection: 28.3.3.1.2.1 [locale.category], 28.3.4.2.5.1 [locale.codecvt.general] Status: WP Submitter: Victor Zverovich Opened: 2022-09-05 Last modified: 2024-04-02
Priority: 3
View all other issues in [locale.category].
View all issues with WP status.
Discussion:
Table [tab:locale.category.facets] includes the following two facets:
codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>
However, neither of those actually has anything to do with a locale and therefore
it doesn't make sense to dynamically register them with std::locale
.
Instead they provide conversions between fixed encodings (UTF-8, UTF-16, UTF-32)
that are unrelated to locale encodings other than they may happen to coincide with
encodings of some locales by accident.
codecvt<char[16|32]_t, char, mbstate_t>
in
N2035 which gave no design rationale for using codecvt
in the first
place. Likely it was trying to do a minimal amount of changes and copied the wording for
codecvt<wchar_t, char, mbstate_t>
but unfortunately didn't consider encoding implications.
P0482 changed char
to char8_t
in these facets which
made the issue more glaring but unfortunately, despite the breaking change, it failed to address it.
Apart from an obvious design mistake this also adds a small overhead for every locale
construction because the implementation has to copy these pseudo-facets for no good
reason violating "don't pay for what you don't use" principle.
A simple fix is to remove the two facets from table [tab:locale.category.facets] and make them
directly constructible.
[2022-09-23; Reflector poll]
Set priority to 3 after reflector poll. Send to SG16 (then maybe LEWG).
[2022-09-28; SG16 responds]
SG16 agrees that the codecvt facets mentioned in LWG3767
"codecvt<charN_t, char8_t, mbstate_t>
incorrectly added to locale" are intended to be invariant
with respect to locale. Unanimously in favor.
[Issaquah 2023-02-10; LWG issue processing]
Removing these breaks most code using them today, because the most obvious
way to use them is via use_facet
on a locale, which would throw
if they're removed (and because they were guaranteed to be present, code
using them might have not bothered to check for them using has_facet
).
Instead of removing them, deprecate the guarantee that they're always present
(so move them to D.20 [depr.locale.category]).
Don't bother changing the destructor.
Victor to update wording.
Previous resolution [SUPERSEDED]:
This wording is relative to N4917.
Modify 28.3.3.1.2.1 [locale.category], Table 105 ([tab:locale.category.facets]) — "Locale category facets" — as indicated:
Table 105: Locale category facets [tab:locale.category.facets] Category Includes facets …
ctype ctype<char>, ctype<wchar_t>
codecvt<char, char, mbstate_t>
codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>
codecvt<wchar_t, char, mbstate_t>…
Modify 28.3.4.2.5.1 [locale.codecvt.general] as indicated:
namespace std { […] template<class internT, class externT, class stateT> class codecvt : public locale::facet, public codecvt_base { public: using intern_type = internT; using extern_type = externT; using state_type = stateT; explicit codecvt(size_t refs = 0); ~codecvt(); […] protected:~codecvt();[…] }; }[…]
-3- The specializations required in Table105 [tab:locale.category.facets]106 [tab:locale.spec] (28.3.3.1.2.1 [locale.category]) convert the implementation-defined native character set.codecvt<char, char, mbstate_t>
implements a degenerate conversion; it does not convert at all. The specializationcodecvt<char16_t, char8_t, mbstate_t>
converts between the UTF-16 and UTF-8 encoding forms, and the specializationcodecvt<char32_t, char8_t, mbstate_t>
converts between the UTF-32 and UTF-8 encoding forms.codecvt<wchar_t, char, mbstate_t>
converts between the native character sets for ordinary and wide characters. Specializations onmbstate_t
perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a program-definedstateT
type. Objects of typestateT
can contain any state that is useful to communicate to or from the specializeddo_in
ordo_out
members.
[2023-02-10; Victor Zverovich comments and provides improved wording]
Per today's LWG discussion the following changes have been implemented in revised wording:
Deprecated the facets instead of removing them (also _byname
variants which were previously missed).
Removed the changes to facet dtor since with deprecation it's no longer critical to provide other ways to access them.
[Kona 2023-11-07; move to Ready]
[Tokyo 2024-03-23; Status changed: Voting → WP.]
Proposed resolution:
This wording is relative to N4928.
Modify 28.3.3.1.2.1 [locale.category], Table 105 ([tab:locale.category.facets]) — "Locale category facets" — and Table 106 ([tab:locale.spec]) "Required specializations" as indicated:
[…]
Table 105: Locale category facets [tab:locale.category.facets] Category Includes facets …
ctype ctype<char>, ctype<wchar_t>
codecvt<char, char, mbstate_t>
codecvt<char16_t, char8_t, mbstate_t>
codecvt<char32_t, char8_t, mbstate_t>
codecvt<wchar_t, char, mbstate_t>…
Table 106: Required specializations [tab:locale.spec] Category Includes facets …
ctype ctype_byname<char>, ctype_byname<wchar_t>
codecvt_byname<char, char, mbstate_t>
codecvt_byname<char16_t, char8_t, mbstate_t>
codecvt_byname<char32_t, char8_t, mbstate_t>
codecvt_byname<wchar_t, char, mbstate_t>…
Modify 28.3.4.2.5.1 [locale.codecvt.general] as indicated:
[…]
-3- The specializations required in Table 105 (28.3.3.1.2.1 [locale.category]) convert the implementation-defined native character set.codecvt<char, char, mbstate_t>
implements a degenerate conversion; it does not convert at all.The specializationcodecvt<char16_t, char8_t, mbstate_t>
converts between the UTF-16 and UTF-8 encoding forms, and the specializationcodecvt<char32_t, char8_t, mbstate_t>
converts between the UTF-32 and UTF-8 encoding forms.codecvt<wchar_t, char, mbstate_t>
converts between the native character sets for ordinary and wide characters. Specializations onmbstate_t
perform conversion between encodings known to the library implementer. Other encodings can be converted by specializing on a program-definedstateT
type. Objects of typestateT
can contain any state that is useful to communicate to or from the specializeddo_in
ordo_out
members.
Modify D.20 [depr.locale.category] (Deprecated locale category facets) in Annex D as indicated:
-1- The
ctype
locale category includes the following facets as if they were specified in table Table 105 [tab:locale.category.facets] of 28.3.4.2.5.1 [locale.codecvt.general].codecvt<char16_t, char, mbstate_t> codecvt<char32_t, char, mbstate_t> codecvt<char16_t, char8_t, mbstate_t> codecvt<char32_t, char8_t, mbstate_t>-1- The
ctype
locale category includes the following facets as if they were specified in table Table 106 [tab:locale.spec] of 28.3.4.2.5.1 [locale.codecvt.general].codecvt_byname<char16_t, char, mbstate_t> codecvt_byname<char32_t, char, mbstate_t> codecvt_byname<char16_t, char8_t, mbstate_t> codecvt_byname<char32_t, char8_t, mbstate_t>-3- The following class template specializations are required in addition to those specified in 28.3.4.2.5 [locale.codecvt]. The specializations
codecvt<char16_t, char, mbstate_t>
andcodecvt<char16_t, char8_t, mbstate_t>
convertsbetween the UTF-16 and UTF-8 encoding forms, and the specializationscodecvt<char32_t, char, mbstate_t>
andcodecvt<char32_t, char8_t, mbstate_t>
convertsbetween the UTF-32 and UTF-8 encoding forms.