5 Lexical conventions [lex]
integer-literal:
binary-literal integer-suffixopt
octal-literal integer-suffixopt
decimal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
binary-literal:
0b binary-digit
0B binary-digit
binary-literal 'opt binary-digit
octal-literal:
0
octal-literal 'opt octal-digit
decimal-literal:
nonzero-digit
decimal-literal 'opt digit
hexadecimal-literal:
hexadecimal-prefix hexadecimal-digit-sequence
binary-digit: one of
0 1
octal-digit: one of
0 1 2 3 4 5 6 7
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
hexadecimal-prefix: one of
0x 0X
hexadecimal-digit-sequence:
hexadecimal-digit
hexadecimal-digit-sequence 'opt hexadecimal-digit
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
integer-suffix:
unsigned-suffix long-suffixopt
unsigned-suffix long-long-suffixopt
long-suffix unsigned-suffixopt
long-long-suffix unsigned-suffixopt
unsigned-suffix: one of
u U
long-suffix: one of
l L
long-long-suffix: one of
ll LL
[
Note: The prefix and any optional separating single quotes are ignored
when determining the value
. —
end note ]
Table
7: Base of
integer-literals [tab:lex.icon.base]
Kind of integer-literal | base N |
binary-literal | 2 |
octal-literal | 8 |
decimal-literal | 10 |
hexadecimal-literal | 16 |
The
hexadecimal-digits
a through
f and
A through
F
have decimal values ten through fifteen
. [
Example: The number twelve can be written
12,
014,
0XC, or
0b1100. The
integer-literals
1048576,
1'048'576,
0X100000,
0x10'0000, and
0'004'000'000 all have the same value
. —
end example ]
The type of an
integer-literal is
the first type in the list in Table
8
corresponding to its optional
integer-suffix
in which its value can be represented
. Table
8: Types of
integer-literals [tab:lex.icon.type]
integer-suffix | decimal-literal | integer-literal other than decimal-literal |
none | int | int |
| long int | unsigned int |
| long long int | long int |
| | unsigned long int |
| | long long int |
| | unsigned long long int |
u or U | unsigned int | unsigned int |
| unsigned long int | unsigned long int |
| unsigned long long int | unsigned long long int |
l or L | long int | long int |
| long long int | unsigned long int |
| | long long int |
| | unsigned long long int |
Both u or U | unsigned long int | unsigned long int |
and l or L | unsigned long long int | unsigned long long int |
ll or LL | long long int | long long int |
| | unsigned long long int |
Both u or U | unsigned long long int | unsigned long long int |
and ll or LL | | |
If an
integer-literal
cannot be represented by any type in its list and
an extended integer type (
[basic.fundamental]) can represent its value,
it may have that extended integer type
. If all of the types in the list for the
integer-literal
are signed,
the extended integer type shall be signed
. If all of the types in the list for the
integer-literal
are unsigned,
the extended integer type shall be unsigned
. If the list contains both signed and unsigned types,
the extended integer type may be signed or unsigned
. A program is ill-formed
if one of its translation units contains an
integer-literal
that cannot be represented by any of the allowed types
.character-literal:
encoding-prefixopt ' c-char-sequence '
encoding-prefix: one of
u8 u U L
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the basic source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
simple-escape-sequence: one of
\' \" \? \\
\a \b \f \n \r \t \v
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
An ordinary character literal that contains a
single
c-char representable in the execution character
set has type
char, with value equal to the
numerical value of the encoding of the
c-char in the
execution character set
. A multicharacter literal, or an ordinary character literal containing
a single
c-char not representable in the execution
character set, is conditionally-supported, has type
int,
and has an
implementation-defined value
. The value of a UTF-8 character literal
is equal to its ISO/IEC 10646 code point value,
provided that the code point value
can be encoded as a single UTF-8 code unit
. [
Note: That is, provided the code point value is in the range
[0,7F] (hexadecimal)
. —
end note ]
If the value is not representable with a single UTF-8 code unit,
the program is ill-formed
. A UTF-8 character literal containing multiple
c-chars is ill-formed
. The value of a UTF-16 character literal
is equal to its ISO/IEC 10646 code point value,
provided that the code point value is
representable with a single 16-bit code unit
. [
Note: That is, provided the code point value is in the range
[0,FFFF] (hexadecimal)
. —
end note ]
If the value is not representable
with a single 16-bit code unit, the program is ill-formed
. A UTF-16 character literal
containing multiple
c-chars is ill-formed
. The value of a
UTF-32 character literal containing a single
c-char is equal
to its ISO/IEC 10646 code point value
. A UTF-32 character literal containing
multiple
c-chars is ill-formed
. A wide-character literal has type
wchar_t.
The value of a wide-character literal containing a single
c-char has value equal to the numerical value of the encoding
of the
c-char in the execution wide-character set, unless the
c-char has no representation in the execution wide-character set, in which
case the value is
implementation-defined
. [
Note: The type
wchar_t is able to
represent all members of the execution wide-character set (see
[basic.fundamental])
. —
end note ]
The value
of a wide-character literal containing multiple
c-chars is
implementation-defined
. Certain non-graphic characters, the single quote
', the double quote
",
the question mark
?,
and the backslash
\, can be represented according to
Table
9. The double quote
" and the question mark
?, can be
represented as themselves or by the escape sequences
\" and
\? respectively, but
the single quote
' and the backslash
\
shall be represented by the escape sequences
\' and
\\ respectively
. Escape sequences in
which the character following the backslash is not listed in
Table
9 are conditionally-supported, with
implementation-defined semantics
. An escape sequence specifies a single
character
.Table
9: Escape sequences [tab:lex.ccon.esc]
new-line | NL(LF) | \n |
horizontal tab | HT | \t |
vertical tab | VT | \v |
backspace | BS | \b |
carriage return | CR | \r |
form feed | FF | \f |
alert | BEL | \a |
backslash | \ | \\ |
question mark | ? | \? |
single quote | ' | \' |
double quote | " | \" |
octal number | ooo | \ooo |
hex number | hhh | \xhhh |
The escape
\ooo consists of the backslash followed by one,
two, or three octal digits that are taken to specify the value of the
desired character
. The escape
\xhhh
consists of the backslash followed by
x followed by one or more
hexadecimal digits that are taken to specify the value of the desired
character
. There is no limit to the number of digits in a hexadecimal
sequence
. A sequence of octal or hexadecimal digits is terminated by the
first character that is not an octal digit or a hexadecimal digit,
respectively
. [
Note: If the value of a
character-literal prefixed by
u,
u8, or
U
is outside the range defined for its type,
the program is ill-formed
. —
end note ]
A
universal-character-name is translated to the encoding, in the appropriate
execution character set, of the character named
. [
Note: In translation phase 1, a
universal-character-name is introduced whenever an
actual extended
character is encountered in the source text
. However,
the actual compiler implementation may use its own native character set,
so long as the same results are obtained
. —
end note ]
string-literal:
encoding-prefixopt " s-char-sequenceopt "
encoding-prefixopt R raw-string
s-char-sequence:
s-char
s-char-sequence s-char
s-char:
any member of the basic source character set except the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name
raw-string:
" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "
r-char-sequence:
r-char
r-char-sequence r-char
r-char:
any member of the source character set, except a right parenthesis ) followed by
the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char-sequence:
d-char
d-char-sequence d-char
d-char:
any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters
representing horizontal tab, vertical tab, form feed, and newline.
[
Note: The characters
'(' and
')' are permitted in a
raw-string. Thus,
R"delimiter((a|b))delimiter" is equivalent to
"(a|b)". —
end note ]
[
Note: A source-file new-line in a raw string literal results in a new-line in the
resulting execution string literal
. Assuming no
whitespace at the beginning of lines in the following example, the assert will succeed:
const char* p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—
end note ]
[
Example: The raw string
R"a(
)\
a"
)a"
is equivalent to
"\n)\\\na\"\n". The raw string
R"(x = "\"y\"")"
is equivalent to
"x = \"\\\"y\\\"\"". —
end example ]
An ordinary string literal
has type “array of
n const char”
where
n is the size of the string as defined below,
has static storage duration (
[basic.stc]), and
is initialized with the given characters
. A UTF-8 string literal
has type “array of
n const char8_t”,
where
n is the size of the string as defined below;
each successive element of the object representation (
[basic.types]) has
the value of the corresponding code unit of the UTF-8 encoding of the string
. Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals
. A UTF-16 string literal has
type “array of
n const char16_t”, where
n is the
size of the string as defined below;
each successive element of the array
has the value of the corresponding code unit of
the UTF-16 encoding of the string
. [
Note: A single
c-char may
produce more than one
char16_t character in the form of
surrogate pairs
. A surrogate pair is a representation for a single code point
as a sequence of two 16-bit code units
. —
end note ]
A UTF-32 string literal has
type “array of
n const char32_t”, where
n is the
size of the string as defined below;
each successive element of the array
has the value of the corresponding code unit of
the UTF-32 encoding of the string
. A wide string literal has type “array of
n const
wchar_t”, where
n is the size of the string as defined below; it
is initialized with the given characters
. If a UTF-8 string literal token is adjacent to a
wide string literal token, the program is ill-formed
. Any other concatenations are
conditionally-supported with
implementation-defined
behavior
. [
Note: This
concatenation is an interpretation, not a conversion
. Because the interpretation happens in translation phase 6 (after each character from a
string-literal has been translated into a value from the appropriate character set), a
string-literal's initial rawness has no effect on the interpretation or
well-formedness of the concatenation
. —
end note ]
Table
11 has some examples of valid concatenations
.Table
11: String literal concatenations [tab:lex.string.concat]
Source | Means | Source | Means | Source | Means |
u"a" | u"b" | u"ab" | U"a" | U"b" | U"ab" | L"a" | L"b" | L"ab" |
u"a" | "b" | u"ab" | U"a" | "b" | U"ab" | L"a" | "b" | L"ab" |
"a" | u"b" | u"ab" | "a" | U"b" | U"ab" | "a" | L"b" | L"ab" |
Characters in concatenated strings are kept distinct
.[
Example:
"\xA" "B"
contains the two characters
'\xA' and
'B'
after concatenation (and not the single hexadecimal character
'\xAB')
. —
end example ]
After any necessary concatenation, in translation phase
7 (
[lex.phases]),
'\0' is appended to every
string-literal so that programs that scan a string can find its end
. The
size of a
char32_t or wide string literal is the total number of
escape sequences,
universal-character-names, and other characters, plus
one for the terminating
U'\0' or
L'\0'. The size of a UTF-16 string
literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for each
character requiring a surrogate pair, plus one for the terminating
u'\0'. [
Note: The size of a
char16_t
string literal is the number of code units, not the number of
characters
. —
end note ]
The size of a narrow string literal is
the total number of escape sequences and other characters, plus at least
one for the multibyte encoding of each
universal-character-name, plus
one for the terminating
'\0'. Evaluating a
string-literal results in a string literal object
with static storage duration, initialized from the given characters as
specified above
. Whether all
string-literals are distinct (that is, are stored in
nonoverlapping objects) and whether successive evaluations of a
string-literal yield the same or a different object is
unspecified
. boolean-literal:
false
true
The Boolean literals are the keywords
false and
true. Such literals are prvalues and have type
bool.pointer-literal:
nullptr
The pointer literal is the keyword
nullptr. It is a prvalue of type
std::nullptr_t. [
Note: std::nullptr_t is a distinct type that is neither a pointer type nor a pointer-to-member type;
rather, a prvalue of this type is a null pointer constant and can be
converted to a null pointer value or null member pointer value
. —
end note ]
user-defined-literal:
user-defined-integer-literal
user-defined-floating-point-literal
user-defined-string-literal
user-defined-character-literal
user-defined-integer-literal:
decimal-literal ud-suffix
octal-literal ud-suffix
hexadecimal-literal ud-suffix
binary-literal ud-suffix
user-defined-floating-point-literal:
fractional-constant exponent-partopt ud-suffix
digit-sequence exponent-part ud-suffix
hexadecimal-prefix hexadecimal-fractional-constant binary-exponent-part ud-suffix
hexadecimal-prefix hexadecimal-digit-sequence binary-exponent-part ud-suffix
user-defined-string-literal:
string-literal ud-suffix
user-defined-character-literal:
character-literal ud-suffix
ud-suffix:
identifier
The syntactic non-terminal preceding the
ud-suffix in a
user-defined-literal is taken to be the longest sequence of
characters that could match that non-terminal
. Let
S be the set of declarations found by
this lookup
. If
S contains a literal operator with
parameter type
unsigned long long, the literal
L is treated as a call of
the form
operator "" X(nULL)
Otherwise,
S shall contain a raw literal operator
or a numeric literal operator template (
[over.literal]) but not both
. If
S contains a raw literal operator,
the literal
L is treated as a call of the form
operator "" X("n")
Otherwise (
S contains a numeric literal operator template),
L is treated as a call of the form
operator "" X<'c1', 'c2', ... 'ck'>()
where
n is the source character sequence
c1c2...ck. [
Note: The sequence
c1c2...ck can only contain characters from the basic source character set
. —
end note ]
If
S contains a literal operator
with parameter type
long double, the literal
L is treated as a call of
the form
operator "" X(fL)
Otherwise,
S shall contain a raw literal operator
or a numeric literal operator template (
[over.literal]) but not both
. If
S contains a raw literal operator,
the
literal L is treated as a call of the form
operator "" X("f")
Otherwise (
S contains a numeric literal operator template),
L is treated as a call of the form
operator "" X<'c1', 'c2', ... 'ck'>()
where
f is the source character sequence
c1c2...ck. [
Note: The sequence
c1c2...ck can only contain characters from the basic source character set
. —
end note ]
If
L is a
user-defined-string-literal,
let
str be the literal without its
ud-suffix
and let
len be the number of code units in
str
(i.e., its length excluding the terminating null character)
. If
S contains a literal operator template with
a non-type template parameter for which
str is
a well-formed
template-argument,
the literal
L is treated as a call of the form
operator "" X<str>()
Otherwise, the literal
L is treated as a call of the form
operator "" X(str, len)
S shall contain a
literal operator whose only parameter has
the type of
ch and the
literal
L is treated as a call
of the form
operator "" X(ch)
[
Example:
long double operator "" _w(long double);
std::string operator "" _w(const char16_t*, std::size_t);
unsigned operator "" _w(const char*);
int main() {
1.2_w;
u"one"_w;
12_w;
"two"_w;
}
—
end example ]
During concatenation,
ud-suffixes are removed and ignored and
the concatenation process occurs as described in
[lex.string]. [
Example:
int main() {
L"A" "B" "C"_x;
"P"_x "Q" "R"_y;
}
—
end example ]