Specification
THIS SPECIFICATION IS UNDER DEVELOPMENT
Notation
The syntax is specified using Extended Backus-Naur Form (EBNF):
production ::= PRODUCTION_NAME '::=' expression?
expression ::= alternative ("|" alternative)*
alternative ::= term term*
term ::= PRODUCTION_NAME | TOKEN | set | group | option | repetition
set ::= '[' (range | CHAR) (rang | CHAR)* ']'
range ::= CHAR '-' CHAR
group ::= '(' expression ')'
option ::= expression '?'
repetition ::= expression '*'
Productions are expressions constructed from terms and the following operators, in increasing precedence:
| alternation
() grouping
? option (0 or 1 times)
* repetition (0 to n times)
Uppercase production names are used to identify lexical tokens. Non-terminals are in lower case. Lexical tokens are enclosed in single quotes ''.
The form a..b represents the set of characters from a through b as alternatives.
Source code representation
A program consists of one or more translation units stored in files written in the Unicode character set, stored as a sequence of bytes using the UTF-8 encoding. Except for comments and the contents of character and string literals, all input elements are formed only from the ASCII subset (U+0000 to U+007F) of Unicode.
A raw byte stream is translated into a sequence of tokens which white space and non-doc comments are discarded. Doc comments may optionally be discarded as well. The resulting input elements form the tokens that are the terminal symbols of the syntactic grammar.
Lexical Translations
A raw byte stream is translated into a sequence of tokens which white space and non-doc comments are discarded. Doc comments may optionally be discarded as well. The resulting input elements form the tokens that are the terminal symbols of the syntactic grammar.
The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would.
Example:
a--b
is translated asa
,--
,b
, which does not form a grammatically correct expression, even >though the tokenizationa
,-
,-
,b
could form a grammatically correct expression.
Line Terminators
The C3 compiler divides the sequence of input bytes into lines by recognizing line terminators
Lines are terminated by the ASCII LF character (U+000A), also known as "newline". A line termination specifies the termination of the // form of a comment.
Input Elements and Tokens
An input element may be:
- White space
- Comment
- Doc Comment
- Token
A token may be:
- Identifier
- Keyword
- Literal
- Separator
- Operator
A Doc Comment consists of:
- A stream of descriptive text
- A list of directive Tokens
Those input elements that are not white space or comments are tokens. The tokens are the terminal symbols of the syntactic grammar. Whitespace and comments can serve to separate tokens that might be tokenized in another manner. For example the characters +
and =
may form the operator token +=
only if there is no intervening white space or comment.
White Space
White space is defined as the ASCII horizontal tab character (U+0009), form feed character (U+000A), vertical tab (U+000B), carriage return (U+000D), space character (U+0020) and the line terminator character (U+000D).
WHITESPACE ::= [ \t\f\v\r\n]
Letters and digits
UC_LETTER ::= [A-Z]
LC_LETTER ::= [a-z]
LETTER ::= UC_LETTER | LC_LETTER
DIGIT ::= [0-9]
HEX_DIGIT ::= [0-9a-fA-F]
BINARY_DIGIT ::= [01]
OCTAL_DIGIT ::= [0-7]
UC_LETTER_US ::= UC_LETTER | "_"
ALPHANUM ::= LETTER | DIGIT
ALPHANUM_US ::= ALPHANUM | "_"
UC_ALPHANUM_US ::= UC_LETTER_US | DIGIT
Comments
Thre are three types of regular comments:
// text
a line comment. The text between//
and line end is ignored./* text */
block comments. The text between/*
and*/
is ignored. It has nesting behaviour, so for every/*
discovered between the first/*
and the last*/
a corresponding*/
must be found.
Doc comments
/** text **/
doc block comment. The text between/**
and**/
is optionally parsed using the doc comment syntatic grammar. A compiler may choose to read/** text **/
as a regular comment.///
text doc line comment. The text between///
and line end is optionally parsed using the doc comment syntactic grammar. A compiler may choose to read///
as a regular comment.
Identifiers
Identifiers name program entities such as variables and types. An identifier is a sequence of one or more letters and digits. The first character in an identifier must be a letter or underscore.
C3 has three types of identifiers: const identifiers - containing only underscore and upper-case letters, type identifiers - starting with an upper case letter followed by at least one underscore letter and regular identifiers, starting with a lower case letter.
IDENTIFIER ::= "_"* LC_LETTER ALPHANUM_US*
CONST_IDENT ::= "_"* UC_LETTER UC_ALPHANUM_US*
TYPE_IDENT ::= "_"* UC_LETTER "_"* LC_LETTER ALPHANUM_US*
CT_IDENT ::= "$" IDENTIFIER
CT_CONST_IDENT ::= "$" CONST_IDENT
CT_TYPE_IDENT ::= "$" TYPE_IDENT
Keywords
The following keywords are reserved and may not be used as identifiers:
alias as asm
anyerr
assert attribute break
case cast catch
const continue default
defer define do
else enum extern
errtype false fn
generic if import
in local macro
module nextcase nil
public return struct
switch true try
typeid $typeof define
var volatile void
while
bool quad double
float long ulong
int uint byte
short ushort char
isz usz float16
float128
$assert $case $default
$if $for $else
$elif $if $switch
$foreach $endswitch $endif
$endforeach
Operators and punctuation
The following character sequences represent operators and punctuation.
& @ ~ | ^ :
, / $ . ; )
> < # { } -
( ) * [ ] %
>= <= + += -= !
? ?: && -> &= |=
^= /= .. == ({ })
-% +% *% ++ -- %=
!= || :: << >> !!
... *%= +%= -%= <<= >>=
Integer literals
An integer literal is a sequence of digits representing an integer constant. An optional prefix sets a non-decimal base: 0b or 0B for binary, 0o, or 0O for octal, and 0x or 0X for hexadecimal. A single 0 is considered a decimal zero. In hexadecimal literals, letters a through f and A through F represent values 10 through 15.
For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal's value.
INTEGER ::= DECIMAL_LIT | BINARY_LIT | OCTAL_LIT | HEX_LIT
DECIMAL_LIT ::= '0' | [1-9] ('_'* DECIMAL_DIGITS)?
BINARY_LIT ::= '0' [bB] '_'* BINARY_DIGITS
OCTAL_LIT ::= '0' [oO] '_'* OCTAL_DIGITS
HEX_LIT ::= '0' [xX] '_'* HEX_DIGITS
BINARY_DIGIT ::= [01]
HEX_DIGIT ::= [0-9a-fA-F]
DECIMAL_DIGITS ::= DIGIT ('_'* DIGIT)*
BINARY_DIGITS ::= BINARY_DIGIT ('_'* BINARY_DIGIT)*
OCTAL_DIGITS ::= OCTAL_DIGIT ('_'* OCTAL_DIGIT)*
HEX_DIGITS ::= HEX_DIGIT ('_'* HEX_DIGIT)*
42
4_2
0_600
0o600
0O600 // second character is capital letter 'O'
0xBadFace
0xBad_Face
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727
0600 // Invalid, non zero decimal number may not start with 0
_42 // an identifier, not an integer literal
42_ // invalid: _ must separate successive digits
0_xBadFace // invalid: _ must separate successive digits
Floating point literals
A floating-point literal is a decimal or hexadecimal representation of a floating-point constant.
A decimal floating-point literal consists of an integer part (decimal digits), a decimal point, a fractional part (decimal digits), and an exponent part (e or E followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; one of the decimal point or the exponent part may be elided. An exponent value exp scales the mantissa (integer and fractional part) by powers of 10.
A hexadecimal floating-point literal consists of a 0x or 0X prefix, an integer part (hexadecimal digits), a radix point, a fractional part (hexadecimal digits), and an exponent part (p or P followed by an optional sign and decimal digits). One of the integer part or the fractional part may be elided; the radix point may be elided as well, but the exponent part is required. An exponent value exp scales the mantissa (integer and fractional part) by powers of 2.
For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal value.
FLOAT_LIT ::= DEC_FLOAT_LIT | HEX_FLOAT_LIT
DEC_FLOAT_LIT ::= DECIMAL_DIGITS '.' DECIMAL_DIGITS? DEC_EXPONENT?
| DECIMAL_DIGITS DEC_EXPONENT
| '.' DECIMAL_DIGITS DEC_EXPONENT?
DEC_EXPONENT ::= [eE] [+-]? DECIMAL_DIGITS
HEX_FLOAT_LIT ::= '0' [xX] HEX_MANTISSA HEX_EXPONENT
HEX_MANTISSA ::= HEX_DIGITS '.' HEX_DIGITS?
| HEX_DIGITS
| '.' HEX_DIGITS
HEX_EXPONENT ::= [pP] [+-] DECIMAL_DIGITS
Character literals
A character literal may either: (a) represent a constant value of up to 8 bytes (16 bytes on platforms with Int128 support) or (b) a single unicode character. Escape sequences can be used to represent byte values 0-31 and 128-255.
The following backslash escapes are available to escape:
\0 0x00 zero value
\a 0x07 alert/bell
\b 0x08 backspace
\e 0x1B escape
\f 0x0C form feed
\n 0x0A newline
\r 0x0D carriage return
\t 0x09 horizontal tab
\v 0x0B vertical tab
\\ 0x5C backslash
\' 0x27 single quote '
\" 0x22 double quote "
\x Escapes a single byte hex value
\u Escapes a two byte unicode hex value
\U Escapes a four byte unicode hex value
CHAR_ELEMENT ::= [\x20-\x26] | [\x28-\x5B] | [\x5D-\x7F]
CHAR_LIT_BYTE ::= CHAR_ELEMENT | \x5C CHAR_ESCAPE
CHAR_ESCAPE ::= [abefnrtv\'\"\\]
| 'x' HEX_DIGIT HEX_DIGIT
UNICODE_CHAR ::= unicode_char
| 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
| 'U' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
CHARACTER_LIT ::= "'" (CHAR_LIT_BYTE+) | UNICODE_CHAR "'"
String literals
A string literal represents a string constant obtained from concatenating a sequence of characters. String literals are character sequences between double quotes, as in "bar". Within the quotes, any character may appear except newline and unescaped double quote. The text between the quotes forms the value of the literal, with backslash escapes interpreted as they are in rune literals, with the same restrictions. The two-digit hexadecimal (\xnn) escapes represent individual bytes of the resulting string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of individual characters. Thus inside a string literal \xFF represent a single byte of value 0xFF=255, while ΓΏ, \u00FF, \U000000FF and \xc3\xbf represent the two bytes 0xc3 0xbf of the UTF-8 encoding of character U+00FF.
STRING_LIT ::= \x22 (CHAR_LIT_BYTE | UNICODE_CHAR)* \x22
Types
Types consist of built-in types and user-defined types (enums, structs, unions, bitstructs, errtype and distinct).
Boolean types
The
Integer types
Floating point types
Vector types
Complex types
A complex type is defined as a struct with two elements of the same floating point type. The first member holds the real part of a complex number, and the second member holds the imaginary part.
struct Complex
{
float real;
float imaginary;
}
String types
Array types
An array has the alignment of its element. Zero sized arrays are allowed and have the size 0.
Vararray types
Subarray types
The subarray consist of a pointer, followed by a usz length, having the alignment of pointers.
Pointer types
Struct types
A struct without any members or with only arrays of size 0 have the size 0.
The alignment of a struct without any members is 1.
Union types
The alignment of a union without any members is 1.
Error types
Th
Enum types
Function types
Virtual types
Declarations and scope
Expressions
Casts
Pointer casts
Integer to pointer cast
Any integer of pointer size or larger may be explicitly cast to a pointer. An integer to pointer cast is considered non-constant, except in the special case where the integer == 0. In that case, the result is constant null
.
Example:
byte a = 1;
int* b = (int*)a; // Invalid, pointer type is > 8 bits.
int* c = (int*)1; // Valid, but runtime value.
int* d = (int*)0; // Valid and constant value.
Pointer to integer cast
A pointer may be cast to any integer, truncating the pointer value if the size of the pointer is larger than the pointer size. A pointer to integer cast is considered non-constant, except in the special case of a null pointer, where it is equal to the integer value 0.
Example:
fn void test() { ... }
typedef VoidFunc = fn void test();
VoidFunc a = &test;
int b = (int)null;
int c = (int)a; // Invalid, not constant
int d = (int)((int*)1); // Invalid, not constant
Subscript operator
The subscript operator may take as its left side a pointer, array, subarray or vararray. The index may be of any integer type. TODO
NOTE The subscript operator is not symmetrical as in C. For example in C3 array[n] = 33
is allowed, but not n[array] = 33
. This is a change from C.
Operands
Compound Literals
Compound literals have the format
compound_literal ::= TYPE_IDENTIFIER '(' initializer_list ')'
initializer_list ::= '{' (initializer_param (',' initializer_param)* ','?)? '}'
initializer_param ::= expression | designator '=' expression
designator ::= array_designator | range_designator | field_designator
array_designator ::= '[' expression ']'
range_designator ::= '[' range_expression ']'
field_designator ::= IDENTIFIER
range_expression ::= (range_index)? '..' (range_index)?
range_index ::= expression | '^' expression
Taking the address of a compound literal will yield a pointer to stack allocated temporary.
Function calls
Varargs
For varargs, a bool
or any integer smaller than what the C ABI specifies for the c int
type is cast to int
. Any float smaller than a double is cast to double
. Compile time floats will be cast to double. Compile time integers will be cast to c int
type.