Parsing S-Expressions with Kotlin: Char to Token (II)

alt

This is of a series on parsing with Kotlin which starts here

Let’s get into it right away:

class LinearTokenScanner(private val s: Scanner<Char>) : Scanner<Token> {
    var next: Token? = null
    override fun get(): Token? =
        if (next != null) {  // consume next
            val ret = next
            next = null
            ret
        } else {
            s.parse()  // or return parsed value
        }

    override fun peek(): Token? =
        if (next != null) {
            next   // return next
        } else {
            val ret = s.parse()
            next = ret  // or populate next and return it
            ret
        }
 }

Does that look familiar? It sure does look a lot like our character scanner of two installments ago. The only difference is that instead of checking hasNext() and calling next(), we call parse() instead, transforming our character scanner into a token (of the simple, linear variety) scanner.

Parsing cons lists.⌗

Our linear scanner should now be able to read any of the files we are interested in. That is how simple s-expressions are. What we get from it is a long stream of tokens, without any structure to them. In contrast, a symbol library is a broad tree structure:

 (kicad_symbol_lib
	(version 20241209)
	(generator "kicad_symbol_editor")
	(generator_version "9.0")
	(symbol "14529" ...)
	...
 )

It contains many symbols per file, and each symbol itself is made of many subparts. How can we get a hold of that structure? Well, we are into the business of scanners, so the answer is going to be a scanner, and we have seen that all scanners have the same structure. The other thing that we know is that scanners have a type they input, and a type they output. In this case we just built a scanner that puts out tokens, but what will our scanner return?

We do not need anything exotic, just something to keep track of a list of tokens inside the parentheses, so let’s just add a new Token to our collection!

 data class LIST(val tokens: List<Token>) : Token

This is what we now know about the new scanner:

class TokenScanner(private val s: Scanner<Token>) : Scanner<Token> {
    var next: Token? = null

    override fun get(): Token? =
        if (next != null) {
            val ret = next
            next = null
            ret
        } else {
            parse()
        }

    override fun peek(): Token? =
        if (next != null) {
            next
        } else {
            next = parse()
            next
        }

    private fun Scanner<Token>.parse(): Token? = TODO()
}

The only real work we need to do now is to think about parse(), but in the context of processing tokens, instead of characters. Since we already have tokens, we just need to focus on when we see LPAREN vs. when we see something else:

    fun Scanner<Token>.parse(): Token? =
        when (val token = peek()) {
            is Token.LPAREN -> {
                val tokens = mutableListOf<Token>()
                var lookahead = peek()
                while (lookahead != null && lookahead !is Token.RPAREN) {
                    tokens.add(parse() ?: error("WAAAA?"))
                    lookahead = peek()
                }
                Token.LIST(tokens.filter { it != Token.SP })
            }

            else -> get()
            
        }

The only tricky bit here is that we are calling parse() inside of parse().¹ Why do we do that? Because a list can be inside another list, so while we are nomming up things between parentheses to stuff the inside of Token.LIST, we make sure we nom the lists inside (and the lists inside the lists, etc.), first.

Also, minor point, we get rid of SP tokens before we shove everything inside LIST, which I am fine with. I mean, our input sizes are small, so who cares if we O(N^2).

Finito, for now.⌗

There you have it: we have folded a boring sequence of tokens into an origami tree of s-expressions, not unlike how proteins curl themselves out of RNA in the nucleus of a cell. In the process we have seen some Kotlin goodness which helped make our code a bit cleaner and pithier, and discovered some elegant symmetries and uncovered some code structures that naturally arise out of the notion of traversing a sequence of things with one lookahead.

On that point, I’d like to leave you with two challenges.

First, the simpler one: Implement Iterator<T> in terms of Scanner<T>. This proves some equivalence between the two.

Second. Implement the following:

class OneScanner<I,O>(
    val inputScanner: Scanner<I>, 
    parser: Scanner<I>.()-> O
): Scanner<O>

and create two instances of it: one that takes a CharScanner is equivalent to LinearTokenScanner, and one that takes a LinearTokenScanner and implements a TokenScanner.

We will return to the challenges at some point in time. Ahead of us in the world of S-expressions is to convert various lists of tokens into proper data structures. That part is super mechanical and boring, so we are going to take a few fun detours before that.

If you have not seen that before, calling yourself inside yourself is called recursion. Anything powerful can destroy whole entire days of your life, and recursion is very powerful.² ↩︎
How powerful? As powerful as a loop. “That’s not very powerful”, you may think, to which I say 1) two loops are all you need to compute anything computable, and 2) it’s the difference between a wizard and a sorcerer. They can both do magic, but recursion is sorcery. ↩︎