Building a language interpreter: The Lexer.

Building a very simple lexer by hand in Rust.

Dec 21, 2023

A lexer is the part of the interpreter responsible for dividing a program source into valid tokens based on the rules of the given language grammar. The stream of tokens from the lexer then goes into the parser. For instance, a decimal number literal like “10” and a string literal like “hello world” are valid tokens in our language's grammar, so a statement such as ‘return 10’ will yield two tokens: the return token and the literal, 10, token. The order of tokens from the lexer represents a syntax that the parser validates and then uses to construct a parse tree.

Implementing a lexer for our Sox language's grammar is a simple task. Our lexer needs to be able to:

Maintain a reference to the input source it is tokenizing.
Keep track of its progress in tokenizing the input string.
Look ahead n places depending on the token it is processing. For instance, in our grammar, tokens, such as *, ==, etc., require the tokenizer to look ahead one token.
Raise errors when and where appropriate.

The above requirements imply that our lexer needs to maintain some state, so we implement our lexer as a Rust struct, as shown below.

pub struct Lexer<'source> {
    source: &'source str,
    start: usize,
    current: usize,
    line: usize,
}

The source attribute represents the source code of the program the lexer is processing. The source here is a borrowed attribute; there is no need for a lexer to own the source in this case because a lexer can process different sources.

Furthermore, we do not have to but we want to implement our lexer as a stream of tokens, so we implement the iterator trait for our lexer. As shown below, the iterator trait defines a single method, next, that an implementing type must provide.

pub trait Iterator {
    type Item;

    // Required method
    fn next(&mut self) -> Option<Self::Item>;
}

The logic that drives the generation of tokens goes into this next method, as shown in listing 1.0.

impl<'source> Iterator for Lexer<'source> {
    type Item = Token;

    fn next(&mut self) -> Option<Self::Item> {
        if !self.is_at_end() {

            self.start = self.current;
            let character = self.advance();
            let token = if let Some(character) = character {
                match character {
                    '(' => {
                        Some(self.yield_token(LeftParen))
                    }
                    ')' => {
                        Some(self.yield_token(RightParen))
                    }
                    '{' => {
                        Some(self.yield_token(LeftBrace))
                    }
                    '}' => {
                        Some(self.yield_token(RightBrace))
                    }
                    ',' => {
                        Some(self.yield_token(Comma))
                    }
                    '.' => {
                        Some(self.yield_token(Dot))
                    }
                    '-' => {
                        Some(self.yield_token(Minus))
                    }
                    '+' => {
                        Some(self.yield_token(Plus))
                    }
                    ';' => {
                        Some(self.yield_token(Semi))
                    }
                    ':' => {
                        Some(self.yield_token(Colon))
                    }
                    '%' => Some(self.yield_token(Mod)),
                    '*' => {
                        Some(self.yield_token(Star))
                    }
                    '!' => {
                        let token = if self.char_matches('=') {
                            BangEqual
                        } else {
                            Bang
                        };
                        Some(self.yield_token(token))
                    }
                    '=' => {
                        let token = if self.char_matches('=') {
                            EqualEqual
                        } else {
                            Equal
                        };
                        Some(self.yield_token(token))
                    }
                    '<' => {
                        let token = if self.char_matches('=') {
                            LessEqual
                        } else {
                            Less
                        };
                        Some(self.yield_token(token))
                    }
                    '>' => {
                        let token = if self.char_matches('=') {
                            GreaterEqual
                        } else {
                            Greater
                        };
                        Some(self.yield_token(token))
                    }
                    '/' => {
                        if self.char_matches('/') {
                            let comment_value = self.take_while(|ch| ch != '\n');
                            match comment_value{
                                Some((comment, _))=> {
                                    Some(Token::new(TokenType::Comment, comment.to_string(), Literal::String(comment.to_string()), self.line))
                                }
                                None => {
                                    Some(Token::new(TokenType::Error, "Error fetching comment tokens".into(), Literal::None, self.line))
                                }
                            }

                        } else if self.char_matches('*') {
                            let mut found_closing_pair = false;
                            let mut comment_buffer = String::new();
                            while let (Some(ch), Some(next_ch)) = (self.peek(), self.peek_next()) {
                                if ch == '*' && next_ch == '/' {
                                    found_closing_pair = true;
                                    break;
                                } else {
                                    let char = self.advance();
                                    if let Some(ch) = char {
                                        comment_buffer.push(ch);
                                    }
                                }
                            }
                            if !found_closing_pair {
                                panic!("Found an unclosed comment");
                            }
                            self.advance();
                            self.advance();
                            Some(Token::new(TokenType::Comment, comment_buffer.clone(), Literal::String(comment_buffer), self.line))

                        } else {
                            Some(self.yield_token(Slash))
                        }
                    }
                    '\n' => {
                        self.line += 1;
                        Some(self.yield_token(Newline))
                    }
                    '"' => {
                        let sox_string = self.yield_string();
                        self.token_from_result(sox_string)

                    }
                    'A'..='Z' | 'a'..='z' | '_' => {
                        let ident_val = self.yield_identifier();
                        self.token_from_result(ident_val)

                    }
                    '0'..='9' => {
                        let numer_val = self.yield_number();
                        self.token_from_result(numer_val)

                    }
                    ' ' => {
                        Some(self.yield_token(TokenType::Whitespace))
                    }
                    _ => {
                        debug!("Token -{character} - not in allowed set of valid tokens");
                        Some(Token::new(TokenType::Error, "Token -{character} - not in allowed set of valid tokens".into(), Literal::None, self.line))
                    }
                }
            } else {
                Some(Token::new(TokenType::Error, "No more characters to lex".into(), Literal::None, self.line))
            };
            token
        } else {
            None
        }
    }
}

Tokens are also represented using a Rust struct, as shown below.

#[derive(Clone, Debug, PartialEq)]
pub struct Token {
    pub token_type: TokenType,
    pub lexeme: String,
    pub literal: Literal,
    pub line: usize,
}

The token_type tells us the kind of a given token; for example, an identifier such as a function name is a token, and a language grammar primary such as and is also a token. The token_types are a closed set of types as defined below, so we use a Rust enum to represent them as shown below.

#[derive(Copy, Clone, Debug, Eq, PartialEq, PartialOrd, Hash)]
pub enum TokenType {
    // Single character tokens
    LeftParen,
    RightParen,
    LeftBrace,
    RightBrace,
    LeftSqb,
    RightSqb,
    Colon,
    Comma,
    Semi,
    Plus,
    Minus,
    Star,
    Slash,
    Dot,
    Mod,

    // One or two character token
    Less,
    Greater,
    Equal,
    EqualEqual,
    NotEqual,
    LessEqual,
    GreaterEqual,
    Bang,
    BangEqual,

    // Literals
    Identifier,
    Number,
    SoxString,

    // Keywords
    And,
    Class,
    Else,
    False,
    For,
    If,
    Or,
    Return,
    Super,
    True,
    While,
    Def,
    This,
    Let,
    None,
    Print,

    Newline,
    Whitespace,
    Indent,
    Dedent,
    Comment,
    CommentMarker,

    Error,
    EOF,
}

The lexeme represents the characters we read for a given token - we use the owned String type to represent this. Finally, the literal attributes represents values that evaluate to themselves, such as strings, numbers and booleans. We use a Literal type, a closed set, as the type and so we implement this as an enum as shown below.

#[derive(Clone, Debug, PartialEq)]
pub enum Literal {
    String(String),
    Integer(i64),
    Float(f64),
    Boolean(bool),
    None,
}

The lexer works by continuously calling the next function. This function reads the next character in the stream and passes that through the large match statement listed above and this greedily attempts to match chars to tokens. Where a char is a sub-match for a multi-character token, the scanner looks ahead to in an attempt to find the complete token. The scanner continues this till the scanner encounters an EOF character or runs into an error.

That is all there is to our lexer. The source code for our lexer is available at the Sox github repository under the tag - 'sox-lexer'. To manually test our lexer, there is a main.sox module under the resources folder of the code which contains some Sox code. We have also created a cargo run alias, run_sox. To test the lexer, execute cargo run_sox in the project directory.

An implementation detail

In our implementation, the lexer struct borrows the source code, but the tokens and literals have copies of part of the source. This choice was deliberately made, as it correctly models the problem. The lexer operates on a text stream owned by other objects but produces tokens that should own all the data. This design choice also significantly simplifies our code further down the line. Another implementation is to keep all string types in the token and literal as references to the original input string and eliminate all string allocations, thus creating a zero-copy implementation.

We benchmarked both implementations using the criterion crate and noted an almost 30% improvement in speed when using the zero-copy option.

Building with Rust

Discussion about this post