A source program is a collection of UTF-8 encoded characters. The job of the lexer is to group adjacent characters together into distinct tokens, separated by white-space and comments. For the most part, the compiler's grammar-driven parser only uses the tokens, and ignores all human-friendly white-space. A notable exception to this is when line indentation is significant when processing multi-line strings, and determining the end of some blocks or multi-line statements.

Tokens

The type of every token is determined by its first characters:

Numeric Literal

A numeric digit (from '0' to '9') starts an integer or float literal. Although a negative sign ('-') preceding a numeric literal is not considered part of the token (as that dash might be a minus sign), it still has the desired effect of negating the number that follows. Using a negative sign on an unsigned integer literal does not make it a signed one.

Underscores may be used within a numeric literal to improve readability.

Integer Literal

An Integer literal may be:

By default, an integer literal is a signed 32-bit number. To change this, specify one of the following suffixes:

Float Literal

A float literal also starts with a digit from '0' to '9'. To distinguish it from an integer literal, it must contain a decimal point, exponent ('E' or 'e'), or a type suffix that explicitly declares it to be a float ('f', 'd', 'f32' or 'f64'). A period is considered to be a decimal point if it is unambiguously not being used as part of a range operator; which is to say that the period is not immediately followed by another period.

The Float token may specify an exponent, which is indicated by an 'e' or 'E' followed by an optional minus sign and additional numeric digits.

By default, an float literal is a 32-bit number. To change this, specify one of the following suffixes:

Character Literal

A character literal begins and ends with a single quote (') within which must be found a single UTF-8 unicode character. Its type is u8 if its value fits in a byte. Otherwise it is u32 (an unsigned, 32-bit integer) if its does not fit in a byte or has a 'u' appended at the end of the literal (e.g., 'a'u).

Any character whose Unicode value is 0x0020 or higher can be specified explicitly. Alternatively, one of the following escape sequences that begin with '\' may be used:

\a
Alarm (U+0007)
\b
Backspace (U+0008)
\f
Form feed (U+000C)
\n
New-line (U+000A)
\r
Return (U+000D)
\t
Tab (U+0009)
\v
Vertical tab (U+000B)
\\
\
\'
'
\"
"
\0
Null character (U+0000)
\xnn
hexadecimal code value for a byte.
\unnnn
unicode character which matches the specified hexadecimal code point.
\Unnnnnnnn
unicode character which matches the specified hexadecimal code point.

String Literal

With string literals, multiple techniques are offered for specifying text content: All allow specification of multiple unicode characters, but they vary in the handling of a few special characters as needed for different circumstances.

A null character (U+0000) is always appended to the end of a string literal, for C compability. All string literals are treated as immutable.

Escaped vs. Raw Text

Many times it is convenient to use escape sequences in order to visibly include control and unicode characters. Other times, such as with regular expressions or XML text, it is less error-prone and more readable to be able to specify backslashes or double quotes without having to escape them with backslashes.

The first characters of the string literal establishes how escape sequences are handled:

If a string literal begins with a single double-quote (or backtick), it ends with the next double quote (or backtick). If the string literal begins with a triple double-quote, it ends with a triple double-quote. If there are more than three double-quotes at the end, the terminator is the last three:

""""Happy Birthday!""""   // yields the string literal: "Happy Birthday!"

Multi-line String Literals

Some string literals are long enough to require multiple lines of code to specify. It would be convenient to be able to format such content properly using indentation and readable margins, without that formatting necessarily carrying over into the text of the literal.

These are the rules that make that possible:

For example:

// Equivalent to "a\nb"
"
   a
   b\
   "

Lifetime Annotation

A lifetime annotation looks similar to a character literal, in that it begins with a single quote ('). It is followed by a letter and then any number of alphanumeric characters. The absence of a closing single-quote distinguishes it from a character literal.

'a
'static

Identifier

Other than the reserved keywords, a program may define and use any identifier as a variable, member, function, method, type, etc.

Typically, an identifier begins with a letter, '@' (attributes), '#' (metaprogramming) or '_'. A letter may be 'a'-'z', 'A'-'Z', or any unicode-defined universal letter as defined by C99 in ISO/IEC 9899:1999(E) Appendix D. Identifiers are case-sensitive; 'abc' is different from 'ABC'.

Subsequent characters may be a letters, digits, '$', or '_'.

To be able to include other characters, such as punctuation, as part of an identifier enclose the entire identifier in back-ticks.

The following are all valid identifiers:

balance toReturn True _temp_ $ π `*`

Note: '_' by itself (not followed by a letter or a number) is not an identifier, but a special punctuation token.

Keywords

These keywords are reserved and may not be used as identifiers:

and
logical 'and' operator used in boolean expressions
async
asynchronous execution
baseurl
url of the program's source code
break
terminate a loop block
context
The value of the currently executing execution state
continue
Re-iterate a 'while' or 'for' block
each
The block for iterating over a collection of values
else
A clause within an 'if' statement
elif
A clause within an 'if' statement
false
The value of 'false'
if
A conditional statement
in
A clause within a 'each' block
into
A clause within a 'match/with' block
local
ensures variable(s) are treated as locally scoped
match
Matches a calculated value to several possible values
new
Creates a new instance
not
logical 'not' operator used in boolean expressions
or
logical 'or' operator used in boolean expressions
return
terminate execution of a method with a return value
self
references the method's self parameter value
selfmethod
the currently executing method or closure
this
The value of the most inclusive 'with' block
true
The value of 'true'
using
clause on this block
wait
A block that waits until all its execution contexts are done.
while
A repetitive block
with
A clause in a 'match' block
yield
suspend a generator with a return value

Operator and Precedence

An operator is a sequence of one or more punctuation characters with no intervening white space. The compiler is greedy, and will look for the longest character sequence that matches one of these operators.

The operators are sequenced from highest to lowest evaluation priority. Operators grouped together have the same priority. In parenthesis is shown whether an operator appears in front of a value (prefix) or between values (infix). Some operators can be used in both contexts. It also specifies the symbolic name of any method associated with that operator.

Value operators

( )
(p) prioritizes expression to be evaluated as a group
[ ]
(p) array
{ }
(p) code block

Term operators

.
(i) method call/property access
::
(i '[]') indexed access
( )
(i) method call parameters
[ ]
(i '[]') indexed access

Prefix operators

-
(p '@-') negate
@
(p) Internet resource load
<<
(p '<<') append to 'this'
>>
(p '>>') prepend to 'this'

Arithmetic/Collection operators

**
(i '**') exponent
*
(i '*') multiply
/
(i '/') divide/split
%
(i '%') remainder
+
(i '+') add
-
(i '-') subtract

Range operator

..
(i) creates a range

Evaluation operators

==
(i '<=>') equal
!=
(i '<=>') not equal
===
(i) equivalent
is
(i 'is') match
<=>
(i '<=>') compare (rocketship)
<
(i '<=>') less than
<=
(i '<=>') less than or equal
>
(i '<=>') greater than
>=
(i '<=>') greater than or equal

Evaluation operators

! not
(p) logical not
&& and
(i) logical and
|| or
(i) logical or

Append/Prepend operators

<<
(i '<<') append
>>
(i '>>') prepend

Assignment operators

Note: Within variable declarations, the evaluation priority of ',' and '=' are reversed.

,
(i) value separation
=
(i) assign
+=
(i '+') add in place
-=
(i '-') subtract in place
*=
(i '*') multiply in place
/=
(i '/') divide in place

Statement terminator

The semicolon ; is the lowest priority "operator".

End-of-File

The program code ends when it reaches a null character (U+0000), end-of-file character (U+001A) or has no more characters. The lexer will tidy up whatever is unfinished (e.g., still open blocks, literals, or comments) and then will generate the end-of-file token for the parser.

Whitespace

Whitespace consists of the source characters which are largely ignored by the parser. Whitespace comes in two forms:

Significant Indentation

The newline character separates one line from the next (carriage returns are ignored). Every line's indentation is the count of all spaces or tabs found at the start of each line. To reduce errors, a source program should be consistent in the indentation character it uses; either use only spaces or only tabs. A warning is produced when a source program uses both tabs and spaces as indentation characters.

Most of the time, a line's indentation is irrelevant to the grammar and is ignored. However, there are times the parser will pay attention to indentation to determine when a statement or block ends.

End-of-Statement Determination

Some statements (e.g., if) finish with a block. The rest are terminated by a semi-colon. If multiple statements are found on the same line, they must be separated by semi-colons.

isEmpty = true; count = repl * elements

However, a semi-colon may be omitted when it would normally be placed at the end of a line. In most cases, the parser uses grammar rules to correctly determine when one statement ends and another begins, consuming as many lines as needed to finish each statement. Most of the time, this works out well and accurately. However, sometimes it can be grammatically ambiguous whether a statement should consume the next line, as it could validly be a separate statement or part of the existing one. This occurs when the follow-on line begins with any operator (such as '*', '-', '.', '?.', '<-', '(', or '[') that may serve as either a prefix or infix operator.

mut a = b
*c

Should this be understood to be mut a = b * c or mut a = b; *c? Both are grammatically valid.

To make this determination, the compiler looks at whether the follow-on line(s) are indented from the first line of the statement. If so, it interprets it as a continuation of the statement. If not, it interprets it as a new statement.

These rules mean the above example is considered to be two statements. To make it a single statement, use indentation (which also helps people):

mut a = b
  * c

Blocks and Significant Indentation

Cone, like many languages, supports the use of curly braces to delimit a block consisting of multiple statements:

if a == 0 {
  break
}

Some people really prefer how nicely braces visually delimit the begin and end of blocks. However, for very small blocks, like this one, this approach wastes precious vertical space viewable in an editor. The '}' closing brace takes up a whole line by itself. And for some style guides, the '{' opening brace is also put on its own second line. These wasted lines mean less of the program's logic is viewable in an editor without scrolling.

For those eager to compact lines together more readably, Cone provides an alternative syntax for delimiting blocks. Instead of beginning the block with '{', use ':' instead. This can either be followed by the block statement(s) on the same line, or one or more statements indented on the following lines.

Here is the above example where the block statement follows on the same line.

if a == 0: break

One can also specify multiple statements for the block on the same line, separated by semicolons:

if onSameLine: doSomething(); doSomethingElse()

The other approach is to put the block's statement(s) on follow-on lines, indented from the previous line. The block ends when the indentation goes back to its original amount.

if a == 0:
  outlier = current
  break
b = 5

This example is equivalent to this curly-brace version:

if a == 0 {
  outlier = current
  break
}
b = 5

Comments

Comments make the code's logic easier to understand and maintain. They are for people. They have no impact on program execution.

Comments may be placed between any two tokens (and certainly not within a text literal).

There are two types of comments: