Parsing S-Expressions with Kotlin: Char to Token

This is of a series on parsing with Kotlin which starts here
In our quest to parse s-expressions such as these:
font
"some string"
(font 12.5 12.5)
… we have elucidated the kinds of tokens we want to break these expressions into, and we have also surmised that we need a one-character lookahead in order to do things properly.
Let’s take a look now at how we can read characters from our scanner and turn them into tokens. We are going to flesh out this method:
fun Scanner<Char>.parse():Token {
TODO()
}
This kotlin syntax defines a function with a receiver, inside of which the receiver
can be accessed by using the keyword this.
[NB: In Kotlin
thiscan refer to different things inside the body of a function, when lambdas are involed (because lambdas can have receivers too!); in that case, the syntax, e.g.this@parsewould disambiguatethis
To make things more interesting, this can be elided altogether. I think this feature
combines immense power and pithiness, but some people disagree. Those same people tend to
gravitate to bash and Python and emit spores. Stay clear and wear a mask.
So where do we start? The signature for Scanner.peek() returns a null when no more data
is available. Normies would add an EOF token to signify the end of input… but Kotlin has
already given us the answer: return null. Thus:
fun Scanner<Char>.parse():Token {
when (peek()){
null -> null
else -> TODO()
}
}
Let’s go on, and take a look at how we handle '(', and we’ll do the same off camera
for its mate, and for a space:
fun Scanner<Char>.parse():Token {
when(peek()) {
null -> null
'(' -> {
get()
Token.LPAREN
}
else -> TODO()
}
}
If it feels dumb to convert spaces as tokens is because it is,
but I have no interest in working out the details of having parse() to use a while loop
just so that we can skip spaces, for now. We can clean up the mess later. Also, mark my words,
we are not done doing dumb things; I mean, the astute reader will have realized that we are
reading our input one character at a time, like a barbarian entering the temple of Pallas Athena
wearing a loincloth. Buffers man!
The rest is shoveling coal: when we see a '"', we know it is a string, which we read until
we find the next quote. Symbols begin with [a-z][A-Z] and go until we find a character that
is not symbol-y. Numbers are kinda the same. Here’s the result:
fun parse(): Token? =
when (peek()) {
null -> null
'(' -> {
get()
Token.LPAREN
}
')' -> {
get()
Token.RPAREN
}
'"' -> Token.STRING(consumeString(s))
' ' -> {
get()
Token.SPACE
}
'-', in '0'..'9' -> Token.NUM.new(consumeNumber())
in 'a'..'z', in 'A'..'Z' -> Token.SYMBOL(s.consumeSymbol())
else -> error("Unexpected character: ${peek()}")
}
(Yes, this is the first time we see Token.SPACE. We did the needful in the Token interface to make it happen)
The consume() methods are pedestrian. here’s consumeString() for reference:
fun Scanner<Char>.consumeString(): String {
var result = ""
get() // consume starting "
while (peek() != null && peek() != '"') {
result += get()!!
}
get() // consume ending "
return result
}
Symbols, e.g. do not need to consume any extraneous characters but the ones that accumulate into the symbol’s name.
In the next installment, we’ll put the parse() method to use to implement a
Scanner<Token>that gets its input from a Scanner<Char>.
Both Iterator<T> and Scanner<T> lend themselves well to stacking, where a scanner is built
on top of another, on top of another. You could, for instance, implement a Scanner<Token>
on top of a Scanner<Token> that simply ignores Token.SPACE. We’ll use that same idea to
parse the stuff inside parentheses (cons lists) into a new kind of Token.
Computer Science says (and if it does not, it should) that you can take any stack of Scanner<T>
and convert them into a single implementation, but it does not really buy you any improvement in algorithmic
efficiency, since Scanner<T> simply pulls the data as it is needed, across any of the layers involved.
The upside is simplicity, as each layer can be committed to a single responsibility. In our case we’ll have these:
- Convert
Charto simple lexical tokens. - Glob up cons lists into a more complex cons list structure.
- Convert tokens into data structures representing a symbol library.
The funny part is, that at the top of this layer cake, our scanner will return a single item: a symbol library.
Let’s marry parse() and Scanner<Char> into a Scanner<T>