5 Lexical conventions [lex]

5.13 Literals [lex.literal]

5.13.5 String literals [lex.string]

string-literal:
	encoding-prefix $_{o p t}$  " s-char-sequence $_{o p t}$  "
	encoding-prefix $_{o p t}$  R raw-string

s-char-sequence:
	s-char
	s-char-sequence s-char

s-char:
	any member of the basic source character set except the double-quote ", backslash \, or new-line character
	escape-sequence
	universal-character-name

raw-string:
	" d-char-sequence $_{o p t}$  ( r-char-sequence $_{o p t}$  ) d-char-sequence $_{o p t}$  "

r-char-sequence:
	r-char
	r-char-sequence r-char

r-char:
	any member of the source character set, except a right parenthesis ) followed by
		the initial d-char-sequence (which may be empty) followed by a double quote ".

d-char-sequence:
	d-char
	d-char-sequence d-char

d-char:
	any member of the basic source character set except:
		space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters
		representing horizontal tab, vertical tab, form feed, and newline.

A string-literal that has an R in the prefix is a raw string literal.

The d-char-sequence serves as a delimiter.

The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence .

A d-char-sequence shall consist of at most 16 characters.

[ Note

The characters '(' and ')' are permitted in a raw-string .

Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".

— end note

]

[ Note

A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.

Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);

— end note

]

[ Example

The raw string

R"a(
)\
a"
)a"

is equivalent to "\n)\\\na\"\n".

The raw string

R"(x = "\"y\"")"

is equivalent to "x = \"\\\"y\\\"\"".

— end example

]

After translation phase 6, a string-literal that does not begin with an encoding-prefix is an ordinary string literal.

An ordinary string literal has type “array of n const char” where n is the size of the string as defined below, has static storage duration ([basic.stc]), and is initialized with the given characters.

A string-literal that begins with u8, such as u8"asdf", is a UTF-8 string literal.

A UTF-8 string literal has type “array of n const char8_t”, where n is the size of the string as defined below; each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.

Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.

A string-literal that begins with u, such as u"asdf", is a UTF-16 string literal.

A UTF-16 string literal has type “array of n const char16_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-16 encoding of the string.

[ Note

A single c-char may produce more than one char16_t character in the form of surrogate pairs.

A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units.

— end note

]

A string-literal that begins with U, such as U"asdf", is a UTF-32 string literal.

A UTF-32 string literal has type “array of n const char32_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-32 encoding of the string.

A string-literal that begins with L, such as L"asdf", is a wide string literal.

A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.

In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.

If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix .

If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand.

If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed.

Any other concatenations are conditionally-supported with implementation-defined behavior.

[ Note

This concatenation is an interpretation, not a conversion.

Because the interpretation happens in translation phase 6 (after each character from a string-literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation.

— end note

]

Table 11 has some examples of valid concatenations.

Table 11: String literal concatenations [tab:lex.string.concat]

Source		Means	Source		Means	Source		Means
u"a"	u"b"	u"ab"	U"a"	U"b"	U"ab"	L"a"	L"b"	L"ab"
u"a"	"b"	u"ab"	U"a"	"b"	U"ab"	L"a"	"b"	L"ab"
"a"	u"b"	u"ab"	"a"	U"b"	U"ab"	"a"	L"b"	L"ab"

Characters in concatenated strings are kept distinct.

[ Example

"\xA" "B"

contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').

— end example

]

After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to every string-literal so that programs that scan a string can find its end.

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character-literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \, and except that a universal-character-name in a UTF-16 string literal may yield a surrogate pair.

In a narrow string literal, a universal-character-name may map to more than one char or char8_t element due to multibyte encoding.

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'.

The size of a UTF-16 string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'.

[ Note

The size of a char16_t string literal is the number of code units, not the number of characters.

— end note

]

[ Note

Any universal-character-names are required to correspond to a code point in the range

[0, D 800)

[E 000, 10 F F F F]

(hexadecimal) ([lex.charset]).

— end note

]

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above.

Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

[ Note

The effect of attempting to modify a string-literal is undefined.

— end note

]