Section: 28.6.12 [re.grammar] Status: New Submitter: Hubert Tong Opened: 2015-10-08 Last modified: 2024-10-03
Priority: 4
View other active issues in [re.grammar].
View all other issues in [re.grammar].
View all issues with New status.
Discussion:
In 28.6.12 [re.grammar] paragraph 2:
basic_regexmember functions shall not call any locale dependent C or C++ API, including the formatted string input functions. Instead they shall call the appropriate traits member function to achieve the required effect.
Yet, the required interface for a regular expression traits class (28.6.2 [re.req]) does not appear to have
any reliable method for determining whether a character as encoded for the locale associated with the traits
instance is the same as a character represented by a UnicodeEscapeSequence, e.g., assuming a sane
ru_RU.koi8r locale:
#include <stdio.h>
#include <stdlib.h>
#include <regex>
const char data[] = "\xB3";
const char matchCyrillicCaptialLetterYo[] = R"(\u0401)";
int main(void)
{
try {
std::regex myRegex;
myRegex.imbue(std::locale("ru_RU.koi8r"));
myRegex.assign(matchCyrillicCaptialLetterYo, std::regex_constants::ECMAScript);
printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str());
myRegex.assign("[[:alpha:]]", std::regex_constants::ECMAScript);
printf("(%s)\n", std::regex_replace(std::string(data), myRegex, std::string("E")).c_str());
} catch (std::regex_error& e) {
abort();
}
return 0;
}
The implementation I tried prints:
(Ё) (E)
Which means that the character class matching worked, but not the matching to the UnicodeEscapeSequence.
[2024-10-03; Jonathan comments]
std::basic_regex<charT> only properly supports
matching single code units that fit in charT.
There's nothing in the spec that supports matching code points that
require multiple code units, let alone checking whether a character
in an arbitrary encoding corresponds to any given Unicode code point.
28.6.12 [re.grammar] paragraph 12 appears to be an attempt to
allow implementations to fail to match here, but is insufficient.
When is_unsigned_v<char> is true, the CV of the
UnicodeEscapeSequence "\u0080" is not greater than CHAR_MAX,
but that doesn't help because U+0080 is encoded as two bytes in UTF-8.
Being able to represent 0x80 as char does not mean the CV can be
matched as a single char.
The API is unsuitable for Unicode-aware strings.
Proposed resolution: