An implementation shall support input files that are a sequence of UTF-8 code units (UTF-8 files).
It may also support an implementation-defined set of other kinds of input files, and, if so, the kind of an input file is determined in an implementation-defined manner that includes a means of designating input files as UTF-8 files, independent of their content.
If an input file is determined to be a UTF-8 file, then it shall be a well-formed UTF-8 code unit sequence and it is decoded to produce a sequence of Unicode scalar values.
A sequence of translation character set elements is then formed by mapping each Unicode scalar value to the corresponding translation character set element.
In the resulting sequence, each pair of characters in the input sequence consisting of U+000d carriage return followed by U+000a line feed, as well as each U+000d carriage return not immediately followed by a U+000a line feed, is replaced by a single new-line character.
If the first translation character is U+feff byte order mark, it is deleted.
Each sequence of a backslash character (\) immediately followed by zero or more whitespace characters other than new-line followed by a new-line character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part of such a splice.
A source file that is not empty and that does not end in a new-line character, or that ends in a splice, shall be processed as if an additional new-line character were appended to the file.
The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments).
Each comment is replaced by one space character.
New-line characters are retained.
Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified.
As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized and replaced by the designated element of the translation character set.
The process of dividing a source file's characters into preprocessing tokens is context-dependent.
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.
A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
All preprocessing directives are then deleted.
Whitespace characters separating tokens are no longer significant.
The resulting tokens constitute a translation unit and are syntactically and semantically analyzed and translated.
It is implementation-defined whether the sources for module units and header units on which the current translation unit has an interface dependency ([module.unit], [module.import]) are required to be available.
Source files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation.
The description is conceptual only, and does not specify any particular implementation.— end note]
Translated translation units and instantiation units are combined as follows: .
The definitions of the required templates are located.
It is implementation-defined whether the source of the translation units containing these definitions is required to be available.
The program is ill-formed if any instantiation fails.