Compiling to bytecode.
In our last post, we looked at the basics of the Sox virtual machine and how it runs the bytecode representation of our source code. Now, we will focus on the compiler, which turns source code into the bytecode that our virtual machine executes. The term compiler refers to the parser and compiler for our Sox bytecode interpreter. We will start this post by explaining the general ideas behind a Pratt parser and then implementing the parsing logic for just a Sox expression. In subsequent posts, we shall expand our compiler to support the complete Sox grammar.
The ideas behind the Pratt parser are as follows -
Each token has an associated parse rule that determines how it should be parsed based on context.
Each parse rule comprises an optional prefix rule, an optional infix rule and a precedence.
The prefix or infix rules are just functions that implement parse logic for a token depending on its context. A prefix rule applies when we have standalone tokens or tokens that take operands only to their right. For example, a ! token only operates on values to its right, while a number token such as 1 is a standalone token; these tokens require a prefix rule. On the other hand, Sox has tokens like the + operator, an infix token that expects operands on its left and right; these must have a defined infix rule to parse them. Finally, we have operators like - that can be either a prefix or an infix operator depending on the context and thus require both prefix and infix rules.
The precedence dictates the order of token compilation, similar to the order of arithmetic operations. Tokens with higher precedence are compiled first.
All this information goes into an array accessed by the token type.
Let's examine some code implementations to understand these ideas more concretely. We start with some compiler scaffolding.
The Compiler
The compiler type is a simple struct, as shown below in Listing 1.0.
pub struct Compiler {
chunk: Option<Chunk>,
previous: Option<Token>,
current: Option<Token>,
lexer: Option<Lexer>,
}
The compiler’s position in the stream tokens and the bytecode compiled so far are the only necessary state to track. This is achieved using the following attributes -
chunk
- this is aChunk
instance that is populated with bytecode as the compiler compiles our source code.previous
- this the last token thas was produced by the compiler’s lexer from the token stream.current -
this is the current token referenced by the compiler’s lexer.lexer
- this is an instance of the Lexer for tokenizing the source code. This is the same Lexer that we used in our tree walker interpreter.
The entry point to compiling our source code is the compile method shown below in Listing 1.1.
pub fn compile(&mut self, source: &'static str, chunk: Chunk, i: &Interpreter) -> Result<Chunk, ()> {
let lexer = Lexer::new(source);
self.chunk = Some(chunk);
self.lexer = Some(lexer);
self.advance();
self.expression(i);
let end = self.consume(TokenType::EOF, "Expect end of expression.");
Ok(self.chunk.take().unwrap())
}
When we call the compile
method, we begin by setting the compiler's attributes using the provided parameters. Next, we invoke the advance
method to initiate the tokenization process. Following that, we call the expression
method to parse an expression. This method comprises of just a call to self.parse_with_precedence(Precedence::Assignment, i);
. After parsing the expression, we use the take
method to transfer ownership of the compiled chunk to the caller, leaving our compiler's chunk
attribute as None
. Additionally, our compiler includes the advance
and consume
helper methods, which assist in processing the token stream—methods that you may recognize from earlier discussions on parsing the grammar for our tree-walking interpreter.
Precedence
The precedence of our language tokens is listed in Listing 2.0. We use an enumeration (enum) with a repr(u8)
annotation to enable us to cast precedence into an integer when needed later on. Additionally, we implement the PartialOrd
and PartialEq
traits to establish an ordered relation for the Precedence
type as required. We also define the TryFrom
trait, which will be handy later on.
#[repr(u8)]
#[derive(Clone, Copy, PartialOrd, PartialEq, Debug, Hash, Eq)]
pub enum Precedence {
None,
Assignment,
Or,
And,
Equality,
Comparison,
Term,
Factor,
Unary,
Call,
Primary,
}
impl TryFrom<u8> for Precedence {
type Error = ();
fn try_from(value: u8) -> Result<Self, Self::Error> {
match value {
0 => Ok(Self::None),
1 => Ok(Self::Assignment),
2 => Ok(Self::Or),
3 => Ok(Self::And),
4 => Ok(Self::Equality),
5 => Ok(Self::Comparison),
6 => Ok(Self::Term),
7 => Ok(Self::Factor),
8 => Ok(Self::Unary),
9 => Ok(Self::Call),
10 => Ok(Self::Primary),
_ => Err(()),
}
}
}
Precedence is crucial in the compile process as it determines what groupings of tokens are compiled first. For example, given an expression such as 1+2*3
, a number has a primary precedence, the +
operator has a term precedence, and *
has a factor precedence that is higher than the term precedence, so our compiler should emit byte code for the *
operator before that of the +
operator.
Parse Rule
We have all the pieces in place now to build out parse rules. Each token possesses a parse rule that drives its compilation. A parse rule is defined as shown below in Listing 3.0.
#[derive(Clone, Copy, Default)]
pub struct ParseRule {
pub infix_fn: Option<fn(&mut Compiler, &Interpreter) -> ()>,
pub prefix_fn: Option<fn(&mut Compiler, &Interpreter) -> ()>,
pub precedence: Precedence,
}
impl ParseRule {
pub const fn const_default() -> Self {
Self {
infix_fn: None,
prefix_fn: None,
precedence: Precedence::None,
}
}
pub const fn new(
prefix_fn: Option<fn(&mut Compiler, &Interpreter) -> ()>,
infix_fn: Option<fn(&mut Compiler, &Interpreter) -> ()>,
precedence: Precedence,
) -> Self {
Self {
infix_fn,
prefix_fn,
precedence,
}
}
}
A ParseRule comprises of three attributes -
infix_fn - an optional function for parsing a token when a token occurs in an infix context. An example of an infix context is the context of the
+
operator in the expression1+2
.prefix_fn - an optional function for parsing a token when a token occurs in a prefix context. An example of a prefix context is that of the ! operator in the expression !true.
precedence - the parse precedence of the token.
Each parse rule implicitly encodes some rule of our language grammar - for example, tokens without a prefix_fn cannot start an expression, etc. The ParseRule for each token goes into an array indexed by the token's usize
representation. Listing 4.0 is a cross-section of some entries into this array.
const PARSE_RULES: [ParseRule; 45] = {
let mut data = [ParseRule::const_default(); 45];
...
data[TokenType::Minus as usize] = ParseRule::new(
Some(Compiler::grouping as fn(&mut Compiler, &Interpreter) -> ()),
Some(Compiler::binary as fn(&mut Compiler, &Interpreter) -> ()),
Precedence::Term
);
data[TokenType::Plus as usize] = ParseRule::new(
None,
Some(Compiler::binary as fn(&mut Compiler, &Interpreter) -> ()),
Precedence::Term
);
data[TokenType::Star as usize] = ParseRule::new(
None,
Some(Compiler::binary as fn(&mut Compiler, &Interpreter) -> ()),
Precedence::Factor
);
data[TokenType::Slash as usize] = ParseRule::new(
None,
Some(Compiler::binary as fn(&mut Compiler, &Interpreter) -> ()),
Precedence::Factor
);
data[TokenType::Dot as usize] = ParseRule::new(None, None, Precedence::None);
data[TokenType::Rem as usize] = ParseRule::new(None, None, Precedence::None);
data[TokenType::Bang as usize] = ParseRule::new(
Some(Compiler::unary as fn(&mut Compiler, &Interpreter) -> ()),
None,
Precedence::None
);
data[TokenType::BangEqual as usize] = ParseRule::new(
None,
Some(Compiler::binary as fn(&mut Compiler, &Interpreter) -> ()),
Precedence::Equality
);
data[TokenType::Equal as usize] = ParseRule::new(None, None, Precedence::None);
...
data
};
We use an array in this context because we want to initialize the table as a constant at compile time. This means that mutable data structures like hashmaps and vectors do not work here. Additionally, we have defined constant functions, fn const new
` and fn const const_default
, which are helpers for populating our array with parsing rules at compile time.
Parsing with Precedence
Every token is parsed using the parse_with_precedence
method shown in Listing 5.0 below.
pub fn parse_with_precedence(&mut self, precedence: Precedence, i: &Interpreter) -> Result<(), ()>{
self.advance();
let previous_token_type = self.previous.as_ref().map(|token| token.token_type);
if let Some(previous_token_type) = previous_token_type {
let prefix_rule = self.get_rule(previous_token_type).prefix_fn;
if let Some(prefix_rule_fn) = prefix_rule {
prefix_rule_fn(self, i);
} else {
return Err(());
}
} else{
return Err(());
}
while self.current.is_some() && precedence <= self.get_rule(self.current.as_ref().unwrap().token_type).precedence {
self.advance();
let infix_rule = self.get_rule(self.previous.as_ref().unwrap().token_type).infix_fn;
if let Some(infix_rule_fn) = infix_rule {
infix_rule_fn(self, i)
}
}
Ok(())
}
Here is how this works for the simple expression - 2 + 3 * 4
. The first token is the number 2, which has a primary precedence, the highest parse precedence, so the parser dereferences the rule for parsing a number and invokes the method referenced by the prefix_fn field of this rule; this method is below. This prefix_fn's task is to emit bytecode for the number 2. Now imagine that we have an expression like * 2 +3. The rule for the first token, *, does not define a prefix_fn, so it cannot start an expression; hence, the parser will fail.
pub fn number(&mut self, i: &Interpreter) {
let val = i64::from_str(self.previous.as_ref().unwrap().lexeme).unwrap();
let obj_payload = SoxInt { value: val };
let obj = SoxRef::new_ref(obj_payload, i.types.int_type.to_owned());
self.emit_constant(SoxObjectRef::from(obj));
}
The parser then proceeds to the next token, +.
Here, we are in the middle of an expression, so the rule for this operator must have an infix_fn, or parsing fails. In this case, the infix_fn is the binary method shown below.
pub fn binary(&mut self, i: &Interpreter) {
if let Some(token) = self.previous.as_ref() {
let operator_type = token.token_type;
let rule = self.get_rule(operator_type);
let new_precedence = Precedence::try_from(rule.precedence as u8 + 1).unwrap();
self.parse_with_precedence(new_precedence, i);
match operator_type {
TokenType::Plus => {
self.emit_byte((OpCode::OpAdd, None));
}
TokenType::Minus => {
self.emit_byte((OpCode::OpSubtract, None));
}
TokenType::Star => {
self.emit_byte((OpCode::OpMultiply, None));
}
TokenType::Slash => {
self.emit_byte((OpCode::OpDivide, None));
}
_ => {
return;
}
}
}
}
Given that we are at an infix operator, we must have already parsed the left operand, leaving the right one. The binary method calls parse_with_precedence again to parse the left operand, passing the precedence value of Precedence::try_from(rule.precedence as u8 + 1).unwrap()
. This call parses and returns the subexpression 3*2
as the right operand to the + operator as expected.
Our expression could be composed of multiple prefix tokens, such as !! true
or --2
. Recursive calls in our prefix functions handle this. The unary prefix function below illustrates this.
pub fn unary(&mut self, i: &Interpreter) {
let operator_typeunary = self.previous.as_ref().unwrap().token_type;
self.parse_with_precedence(Precedence::Unary, i)
.expect("TODO: panic message");
match operator_type {
TokenType::Bang => {
self.emit_byte((OpCode::OpNot, None));
return;
}
TokenType::Minus => {
self.emit_byte((OpCode::OpNegate, None));
return;
}
_ => {
return;
}
}
}
Within the rule, we recursively call parse_with_precedence
with a Precedence::Unary
precedence value to handle subexpressionshe final step of compiling expressions is to emit bytecode. If you paid close attention to some of the snippets above, you would notice the emit_byte method calls at the end of some methods. This emit_bye method illustrated below is responsible for emitting bytecode..
Emitting bytecode
The final step of compiling expressions is to emit bytecode. If you paid close attention to some of the snippets above, you would notice the emit_byte method calls at the end of some methods. This emit_bye
method illustrated below is responsible for emitting bytecode.
pub fn emit_byemit_byte(&mut self, data: (OpCode, Option<u8>)) {
let opcode = data.0;
let operand = data.1;
self.chunk
.as_mut()
.unwrap()
.write_chunk(opcode as u8, self.previous.as_ref().unwrap().line);
if let Some(operand) = operand {
self.chunk
.as_mut()
.unwrap()
.write_chunk(operand, self.previous.as_ref().unwrap().line);
}
}
It takes a tuple of opcode and an optional operand as an argument and writes those values to the chunk. The write process is just an append to the chunk's code and line vectors.
That is all there is to compiling expressions with our Pratt Parser style parse. We are now able to execute simple expressions using our bytecode interpreter. In subsequent posts, we will look into how to compile other tokens in our language.
All code for this available on the bytecode_vm branch of our github repo. You will find most of what we have discussed here in the vm mod of the source tree.