2 Lexical conventions [lex]

2.14 Literals [lex.literal]

2.14.5 String literals [lex.string]

string-literal:
    encoding-prefixopt " s-char-sequenceopt "
    encoding-prefixopt R raw-string
encoding-prefix:
  u8
  u
  U
  L
s-char-sequence:
    s-char
    s-char-sequence s-char
s-char:
	any member of the source character set except
		the double-quote ", backslash \, or new-line character
	escape-sequence
	universal-character-name
raw-string:
    " d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "
r-char-sequence:
    r-char
    r-char-sequence r-char
r-char:
	any member of the source character set, except
		a right parenthesis ) followed by the initial  d-char-sequence
		(which may be empty) followed by a double quote ".
d-char-sequence:
    d-char
    d-char-sequence d-char
d-char:
	any member of the basic source character set except:
		space, the left parenthesis (, the right parenthesis ), the backslash \,
		and the control characters representing horizontal tab,
		vertical tab, form feed, and newline.

A string literal is a sequence of characters (as defined in [lex.ccon]) surrounded by double quotes, optionally prefixed by R, u8, u8R, u, uR, U, UR, L, or LR, as in "...", R"(...)", u8"...", u8R"**(...)**", u"...", uR"*~(...)*~", U"...", UR"zzz(...)zzz", L"...", or LR"(...)", respectively.

A string literal that has an R in the prefix is a raw string literal. The d-char-sequence serves as a delimiter. The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence. A d-char-sequence shall consist of at most 16 characters.

Note: The characters '(' and ')' are permitted in a raw-string. Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".  — end note ]

Note: A source-file new-line in a raw string literal results in a new-line in the resulting execution string-literal. Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

 — end note ]

Example: The raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n". The raw string

R"(??)"

is equivalent to "\?\?". The raw string

R"#(
)??="
)#"

is equivalent to "\n)\?\?=\"\n".  — end example ]

After translation phase 6, a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.

A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal.

Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration ([basic.stc]).

For a UTF-8 string literal, each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.

A string literal that begins with u, such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters. A single c-char may produce more than one char16_t character in the form of surrogate pairs.

A string literal that begins with U, such as U"asdf", is a char32_t string literal. A char32_t string literal has type “array of n const char32_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.

In translation phase 6 ([lex.phases]), adjacent string literals are concatenated. If both string literals have the same encoding-prefix, the resulting concatenated string literal has that encoding-prefix. If one string literal has no encoding-prefix, it is treated as a string literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. [ Note: This concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a literal has been translated into a value from the appropriate character set), a string literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation.  — end note ] Table [tab:lex.string.concat] has some examples of valid concatenations.

Table 8 — String literal concatenations
Source Means Source Means Source Means
u"a" u"b" u"ab" U"a" U"b" U"ab" L"a" L"b" L"ab"
u"a" "b" u"ab" U"a" "b" U"ab" L"a" "b" L"ab"
"a" u"b" u"ab" "a" U"b" U"ab" "a" L"b" L"ab"

Characters in concatenated strings are kept distinct.

Example:

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').  — end example ]

After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string literal so that programs that scan a string can find its end.

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'. The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'. [ Note: The size of a char16_t string literal is the number of code units, not the number of characters.  — end note ] Within char32_t and char16_t literals, any universal-character-names shall be within the range 0x0 to 0x10FFFF. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.