A source program is a collection of UTF-8 encoded characters. The job of the lexer is to group adjacent characters together into distinct tokens, separated by white-space and comments. For the most part, the compiler's grammar-driven parser only uses the tokens, and ignores all human-friendly white-space. A notable exception to this is when line indentation is significant when processing multi-line strings, and determining the end of some blocks or multi-line statements.
Tokens
The type of every token is determined by its first characters:
- Integer or Float Literal begins with a numeric digit from '0' to '9'.
- Character Literal begins with a single quote (').
- String Literal begins with a double quote ("), r", or r`
- Lifetime Annotation begins with a single quote (').
- Identifier begins with a unicode-defined letter, an at sign '@', the hash '#', an underscore '_', or a backtick (`).
- Keyword looks like an identifier, but is reserved exclusively for the compiler.
- Operator begins with certain punctuation characters between U+0021 and U+007F.
- End-of-file, which is the null character (U+0000), end-of-file character (U+001A) or the end of the program.
Numeric Literal
A numeric digit (from '0' to '9') starts an integer or float literal. Although a negative sign ('-') preceding a numeric literal is not considered part of the token (as that dash might be a minus sign), it still has the desired effect of negating the number that follows. Using a negative sign on an unsigned integer literal does not make it a signed one.
Underscores may be used within a numeric literal to improve readability.
Integer Literal
An Integer literal may be:
- Decimal integer. A simple sequence of numeric digits which stops at the first non-digit character.
- Hexadecimal integer. Starts with '0x' followed by either numeric digits or 'a'-'f' (or 'A'-'F') to represent the hexadecimal digit values ten through fifteen.
By default, an integer literal is a signed 32-bit number. To change this, specify one of the following suffixes:
- i8 signed, 8-bit.
- i16 signed, 16-bit.
- i64 signed, 64-bit.
- u8 unsigned, 8-bit.
- u16 unsigned, 16-bit.
- u32 or u unsigned, 32-bit.
- u64 unsigned, 64-bit.
- usize unsigned, 32- or 64-bit.
- isize signed, 32- or 64-bit.
Float Literal
A float literal also starts with a digit from '0' to '9'. To distinguish it from an integer literal, it must contain a decimal point, exponent ('E' or 'e'), or a type suffix that explicitly declares it to be a float ('f', 'd', 'f32' or 'f64'). A period is considered to be a decimal point if it is unambiguously not being used as part of a range operator; which is to say that the period is not immediately followed by another period.
The Float token may specify an exponent, which is indicated by an 'e' or 'E' followed by an optional minus sign and additional numeric digits.
By default, an float literal is a 32-bit number. To change this, specify one of the following suffixes:
- f or f32 32-bit.
- d or f64 64-bit.
Character Literal
A character literal begins and ends with a single quote (') within which must be found a single UTF-8 unicode character. Its type is u8 if its value fits in a byte. Otherwise it is u32 (an unsigned, 32-bit integer) if its does not fit in a byte or has a 'u' appended at the end of the literal (e.g., 'a'u).
Any character whose Unicode value is 0x0020 or higher can be specified explicitly. Alternatively, one of the following escape sequences that begin with '\' may be used:
- \a
- Alarm (U+0007)
- \b
- Backspace (U+0008)
- \f
- Form feed (U+000C)
- \n
- New-line (U+000A)
- \r
- Return (U+000D)
- \t
- Tab (U+0009)
- \v
- Vertical tab (U+000B)
- \\
- \
- \'
- '
- \"
- "
- \0
- Null character (U+0000)
- \xnn
- hexadecimal code value for a byte.
- \unnnn
- unicode character which matches the specified hexadecimal code point.
- \Unnnnnnnn
- unicode character which matches the specified hexadecimal code point.
String Literal
With string literals, multiple techniques are offered for specifying text content: All allow specification of multiple unicode characters, but they vary in the handling of a few special characters as needed for different circumstances.
A null character (U+0000) is always appended to the end of a string literal, for C compability. All string literals are treated as immutable.
Escaped vs. Raw Text
Many times it is convenient to use escape sequences in order to visibly include control and unicode characters. Other times, such as with regular expressions or XML text, it is less error-prone and more readable to be able to specify backslashes or double quotes without having to escape them with backslashes.
The first characters of the string literal establishes how escape sequences are handled:
- Just a double quote (") begins a string literal that uses escape sequences. If a double quote or backslash is needed as part of the content, they must be escaped with a backslash.
- 'r' followed by a double quote (r") begins a raw string literal which does not use escape sequences. Backslashes need not be escaped. Double quotes cannot appear in the string literal.
- 'r' followed by a backtick (r`) begins a raw string literal which does not use escape sequences. Backslashes and double-quotes need not be escaped. Backticks cannot appear in the string literal.
- Three double-quotes (""") begins a string literal that uses escape sequences. Backslashes must be escaped, but double quotes need not be.
- 'r' followed by three double quotes (r""") begins a raw string literal which does not use escape sequences. Neither backslashes nor double quotes need to be escaped to appear in the text content.
If a string literal begins with a single double-quote (or backtick), it ends with the next double quote (or backtick). If the string literal begins with a triple double-quote, it ends with a triple double-quote. If there are more than three double-quotes at the end, the terminator is the last three:
""""Happy Birthday!"""" // yields the string literal: "Happy Birthday!"
Multi-line String Literals
Some string literals are long enough to require multiple lines of code to specify. It would be convenient to be able to format such content properly using indentation and readable margins, without that formatting necessarily carrying over into the text of the literal.
These are the rules that make that possible:
- Multi-line string literals must end the line right after the last of the opening double-quotes. The end-of-line character(s) that follow are not included in the string literal.
- The double quotes that terminate the literal must appear at the start of a subsequent line, although they may be indented by any number of spaces or tabs.
- The lines in-between contain the string literal's content. Space or tab indentation equivalent to that found on the terminating double-quotes line are stripped. A new-line character is appended to the line's content, unless the end-of-line is preceded by a backslash.
For example:
// Equivalent to "a\nb" " a b\ "
Lifetime Annotation
A lifetime annotation looks similar to a character literal, in that it begins with a single quote ('). It is followed by a letter and then any number of alphanumeric characters. The absence of a closing single-quote distinguishes it from a character literal.
'a 'static
Identifier
Other than the reserved keywords, a program may define and use any identifier as a variable, member, function, method, type, etc.
Typically, an identifier begins with a letter, '@' (attributes), '#' (metaprogramming) or '_'. A letter may be 'a'-'z', 'A'-'Z', or any unicode-defined universal letter as defined by C99 in ISO/IEC 9899:1999(E) Appendix D. Identifiers are case-sensitive; 'abc' is different from 'ABC'.
Subsequent characters may be a letters, digits, '$', or '_'.
To be able to include other characters, such as punctuation, as part of an identifier enclose the entire identifier in back-ticks.
The following are all valid identifiers:
balance toReturn True _temp_ $ π `*`
Note: '_' by itself (not followed by a letter or a number) is not an identifier, but a special punctuation token.
Keywords
These keywords are reserved and may not be used as identifiers:
- and
- logical 'and' operator used in boolean expressions
- async
- asynchronous execution
- baseurl
- url of the program's source code
- break
- terminate a loop block
- context
- The value of the currently executing execution state
- continue
- Re-iterate a 'while' or 'for' block
- each
- The block for iterating over a collection of values
- else
- A clause within an 'if' statement
- elif
- A clause within an 'if' statement
- false
- The value of 'false'
- if
- A conditional statement
- in
- A clause within a 'each' block
- into
- A clause within a 'match/with' block
- local
- ensures variable(s) are treated as locally scoped
- match
- Matches a calculated value to several possible values
- new
- Creates a new instance
- not
- logical 'not' operator used in boolean expressions
- or
- logical 'or' operator used in boolean expressions
- return
- terminate execution of a method with a return value
- self
- references the method's self parameter value
- selfmethod
- the currently executing method or closure
- this
- The value of the most inclusive 'with' block
- true
- The value of 'true'
- using
- clause on this block
- wait
- A block that waits until all its execution contexts are done.
- while
- A repetitive block
- with
- A clause in a 'match' block
- yield
- suspend a generator with a return value
Operator and Precedence
An operator is a sequence of one or more punctuation characters with no intervening white space. The compiler is greedy, and will look for the longest character sequence that matches one of these operators.
The operators are sequenced from highest to lowest evaluation priority. Operators grouped together have the same priority. In parenthesis is shown whether an operator appears in front of a value (prefix) or between values (infix). Some operators can be used in both contexts. It also specifies the symbolic name of any method associated with that operator.
Value operators
- ( )
- (p) prioritizes expression to be evaluated as a group
- [ ]
- (p) array
- { }
- (p) code block
Term operators
- .
- (i) method call/property access
- ::
- (i '[]') indexed access
- ( )
- (i) method call parameters
- [ ]
- (i '[]') indexed access
Prefix operators
- -
- (p '@-') negate
- @
- (p) Internet resource load
- <<
- (p '<<') append to 'this'
- >>
- (p '>>') prepend to 'this'
Arithmetic/Collection operators
- **
- (i '**') exponent
- *
- (i '*') multiply
- /
- (i '/') divide/split
- %
- (i '%') remainder
- +
- (i '+') add
- -
- (i '-') subtract
Range operator
- ..
- (i) creates a range
Evaluation operators
- ==
- (i '<=>') equal
- !=
- (i '<=>') not equal
- ===
- (i) equivalent
- is
- (i 'is') match
- <=>
- (i '<=>') compare (rocketship)
- <
- (i '<=>') less than
- <=
- (i '<=>') less than or equal
- >
- (i '<=>') greater than
- >=
- (i '<=>') greater than or equal
Evaluation operators
- ! not
- (p) logical not
- && and
- (i) logical and
- || or
- (i) logical or
Append/Prepend operators
- <<
- (i '<<') append
- >>
- (i '>>') prepend
Assignment operators
Note: Within variable declarations, the evaluation priority of ',' and '=' are reversed.
- ,
- (i) value separation
- =
- (i) assign
- +=
- (i '+') add in place
- -=
- (i '-') subtract in place
- *=
- (i '*') multiply in place
- /=
- (i '/') divide in place
Statement terminator
The semicolon ; is the lowest priority "operator".
End-of-File
The program code ends when it reaches a null character (U+0000), end-of-file character (U+001A) or has no more characters. The lexer will tidy up whatever is unfinished (e.g., still open blocks, literals, or comments) and then will generate the end-of-file token for the parser.
Whitespace
Whitespace consists of the source characters which are largely ignored by the parser. Whitespace comes in two forms:
- Whitespace characters. These are all the ASCII control codes (1 - 32),
including space, tab, new-line and carriage return. These are primarily useful
for fashioning a human-digestible layout to the program and separating two tokens.
However, spaces and tabs found at the beginning of a line can sometimes be significant
to determining where a statement or block ends.
Note: If the program source begins with the UTF-8 byte-order mark (U+FEFF), it is ignored.
- Comments. These are used to provide helpful documentation that improve a human reader's understanding of the source program. Cone supports both line and block comments.
Significant Indentation
The newline character separates one line from the next (carriage returns are ignored). Every line's indentation is the count of all spaces or tabs found at the start of each line. To reduce errors, a source program should be consistent in the indentation character it uses; either use only spaces or only tabs. A warning is produced when a source program uses both tabs and spaces as indentation characters.
Most of the time, a line's indentation is irrelevant to the grammar and is ignored. However, there are times the parser will pay attention to indentation to determine when a statement or block ends.
End-of-Statement Determination
Some statements (e.g., if) finish with a block. The rest are terminated by a semi-colon. If multiple statements are found on the same line, they must be separated by semi-colons.
isEmpty = true; count = repl * elements
However, a semi-colon may be omitted when it would normally be placed at the end of a line. In most cases, the parser uses grammar rules to correctly determine when one statement ends and another begins, consuming as many lines as needed to finish each statement. Most of the time, this works out well and accurately. However, sometimes it can be grammatically ambiguous whether a statement should consume the next line, as it could validly be a separate statement or part of the existing one. This occurs when the follow-on line begins with any operator (such as '*', '-', '.', '?.', '<-', '(', or '[') that may serve as either a prefix or infix operator.
mut a = b *c
Should this be understood to be mut a = b * c or mut a = b; *c? Both are grammatically valid.
To make this determination, the compiler looks at whether the follow-on line(s) are indented from the first line of the statement. If so, it interprets it as a continuation of the statement. If not, it interprets it as a new statement.
These rules mean the above example is considered to be two statements. To make it a single statement, use indentation (which also helps people):
mut a = b * c
Blocks and Significant Indentation
Cone, like many languages, supports the use of curly braces to delimit a block consisting of multiple statements:
if a == 0 { break }
Some people really prefer how nicely braces visually delimit the begin and end of blocks. However, for very small blocks, like this one, this approach wastes precious vertical space viewable in an editor. The '}' closing brace takes up a whole line by itself. And for some style guides, the '{' opening brace is also put on its own second line. These wasted lines mean less of the program's logic is viewable in an editor without scrolling.
For those eager to compact lines together more readably, Cone provides an alternative syntax for delimiting blocks. Instead of beginning the block with '{', use ':' instead. This can either be followed by the block statement(s) on the same line, or one or more statements indented on the following lines.
Here is the above example where the block statement follows on the same line.
if a == 0: break
One can also specify multiple statements for the block on the same line, separated by semicolons:
if onSameLine: doSomething(); doSomethingElse()
The other approach is to put the block's statement(s) on follow-on lines, indented from the previous line. The block ends when the indentation goes back to its original amount.
if a == 0: outlier = current break b = 5
This example is equivalent to this curly-brace version:
if a == 0 { outlier = current break } b = 5
Comments
Comments make the code's logic easier to understand and maintain. They are for people. They have no impact on program execution.
Comments may be placed between any two tokens (and certainly not within a text literal).
There are two types of comments:
- A line comment. begins with '//' and continues to the end of line.
- A block comment starts with '/*'. It ends with its corresponding '*/', extending across multiple lines, if desired. Block comments may be nested inside each other. Line comments may be nested inside a block comment. '/*' or '*/' embedded in text literals or line comments are ignored.