Parsing S-Expressions with Kotlin: Char to Token

alt

This is of a series on parsing with Kotlin which starts here

In our quest to parse s-expressions such as these:

 font
 "some string"
 (font 12.5 12.5)

… we have elucidated the kinds of tokens we want to break these expressions into, and we have also surmised that we need a one-character lookahead in order to do things properly.

Let’s take a look now at how we can read characters from our scanner and turn them into tokens. We are going to flesh out this method:

 fun Scanner<Char>.parse():Token {
     TODO()
 }

This kotlin syntax defines a function with a receiver, inside of which the receiver can be accessed by using the keyword this.

[NB: In Kotlin this can refer to different things inside the body of a function, when lambdas are involed (because lambdas can have receivers too!); in that case, the syntax, e.g. this@parse would disambiguate this

To make things more interesting, this can be elided altogether. I think this feature combines immense power and pithiness, but some people disagree. Those same people tend to gravitate to bash and Python and emit spores. Stay clear and wear a mask.

So where do we start? The signature for Scanner.peek() returns a null when no more data is available. Normies would add an EOF token to signify the end of input… but Kotlin has already given us the answer: return null. Thus:

 fun Scanner<Char>.parse():Token {
     when (peek()){
         null -> null
         else -> TODO()
     }
 }

Let’s go on, and take a look at how we handle '(', and we’ll do the same off camera for its mate, and for a space:

 fun Scanner<Char>.parse():Token {
     when(peek()) {
         null -> null
         '(' -> {
             get()
             Token.LPAREN
         }
         else -> TODO()
     }
}

If it feels dumb to convert spaces as tokens is because it is, but I have no interest in working out the details of having parse() to use a while loop just so that we can skip spaces, for now. We can clean up the mess later. Also, mark my words, we are not done doing dumb things; I mean, the astute reader will have realized that we are reading our input one character at a time, like a barbarian entering the temple of Pallas Athena wearing a loincloth. Buffers man!

The rest is shoveling coal: when we see a '"', we know it is a string, which we read until we find the next quote. Symbols begin with [a-z][A-Z] and go until we find a character that is not symbol-y. Numbers are kinda the same. Here’s the result:

 fun parse(): Token? =
     when (peek()) {
         null -> null
         '(' -> {
             get()
             Token.LPAREN
         }

         ')' -> {
             get()
             Token.RPAREN
         }

         '"' -> Token.STRING(consumeString(s))

         ' ' -> {
             get()
             Token.SPACE
         }

         '-', in '0'..'9' -> Token.NUM.new(consumeNumber())

         in 'a'..'z', in 'A'..'Z' -> Token.SYMBOL(s.consumeSymbol())

         else -> error("Unexpected character: ${peek()}")
     }

(Yes, this is the first time we see Token.SPACE. We did the needful in the Token interface to make it happen)

The consume() methods are pedestrian. here’s consumeString() for reference:

fun Scanner<Char>.consumeString(): String {
    var result = ""
    get() // consume starting "
    while (peek() != null && peek() != '"') {
        result += get()!!
    }
    get() // consume ending "
    return result
}

Symbols, e.g. do not need to consume any extraneous characters but the ones that accumulate into the symbol’s name.

In the next installment, we’ll put the parse() method to use to implement a Scanner<Token>that gets its input from a Scanner<Char>.

Both Iterator<T> and Scanner<T> lend themselves well to stacking, where a scanner is built on top of another, on top of another. You could, for instance, implement a Scanner<Token> on top of a Scanner<Token> that simply ignores Token.SPACE. We’ll use that same idea to parse the stuff inside parentheses (cons lists) into a new kind of Token.

Computer Science says (and if it does not, it should) that you can take any stack of Scanner<T> and convert them into a single implementation, but it does not really buy you any improvement in algorithmic efficiency, since Scanner<T> simply pulls the data as it is needed, across any of the layers involved.

The upside is simplicity, as each layer can be committed to a single responsibility. In our case we’ll have these:

Convert Char to simple lexical tokens.
Glob up cons lists into a more complex cons list structure.
Convert tokens into data structures representing a symbol library.

The funny part is, that at the top of this layer cake, our scanner will return a single item: a symbol library.

Let’s marry parse() and Scanner<Char> into a Scanner<T>