Building an interpreter: Basics of the Virtual Machine

it is very simple but not easy...

Feb 06, 2025

The Sox virtual machine executes bytecode. Bytecode here is an array of opcode bytes and in some cases their operands - for example, an opcode to load a constant value will take an index into the location of that value. The compiler is responsible for emitting bytecode from the stream of source code tokens output by the lexer. The virtual machine then interprets bytecode in a massive for loop that switches on each opcode type. Figure 1.0 shows the process from source code to execution.

Figure 1.0: Executing code with the Sox virtual machine.

For now, we will make use of the same lexer from our tree walker interpreter because all we care about is that the right tokens are emitted. The compiler and virtual machine need to be implemented. The virtual machine is very simple to implement, while the compiler is driven by one big idea which once understood, also makes it very simple to implement.

Representing bytecode

Bytecode is just a chunk of bytes where each byte represents an operation code or OpCode. The opcodes are instructions for our virtual machine. They are predominantly simple operations such as adding and subtracting two numbers, loading constants for further operations, etc. Listing 1.0 below is a subset of the opcodes that the Sox virtual machine implements. This gives an idea of the nature of opcodes. As we add functionality to the virtual machine, this collection of opcodes will grow.

#[derive(Debug, Clone, Copy, PartialEq)]
#[repr(u8)]
pub enum OpCode {
    OpConstant,
    OpNone,
    OpTrue,
    OpFalse,
    OpEqual,
    OpGreater,
    OpLess,
    OpAdd,
    OpSubtract,
    OpMultiply,
    OpDivide,
    OpNot,
    OpNegate,
    OpReturn,
}

impl TryFrom<u8> for OpCode {
    type Error = ();

    fn try_from(value: u8) -> Result<Self, Self::Error> {
        // Static mapping array for u8 to OpCode
        const OPCODE_MAP: [Option<OpCode>; 14] = [
            Some(OpCode::OpConstant),
            Some(OpCode::OpNone),
            Some(OpCode::OpTrue),
            Some(OpCode::OpFalse),
            Some(OpCode::OpEqual),
            Some(OpCode::OpGreater),
            Some(OpCode::OpLess), 
            Some(OpCode::OpAdd),
            Some(OpCode::OpSubtract),
            Some(OpCode::OpMultiply),
            Some(OpCode::OpDivide),
            Some(OpCode::OpNot),
            Some(OpCode::OpNegate),
            Some(OpCode::OpReturn),
        ];

        OPCODE_MAP
            .get(value as usize) 
            .and_then(|&opcode| opcode) 
            .ok_or(()) 
    }
}

impl TryFrom<OpCode> for u8 {
    type Error = ();

    fn try_from(op_code: OpCode) -> Result<Self, Self::Error> {
        match op_code {
            OpCode::OpConstant => Ok(0),
            OpCode::OpNone => Ok(1),
            OpCode::OpTrue => Ok(2),
            OpCode::OpFalse => Ok(3),
            OpCode::OpEqual => Ok(4),
            OpCode::OpGreater => Ok(5),
            OpCode::OpLess => Ok(6),
            OpCode::OpAdd => Ok(7),
            OpCode::OpSubtract => Ok(8),
            OpCode::OpMultiply => Ok(9),
            OpCode::OpDivide => Ok(10),
            OpCode::OpNot => Ok(11),
            OpCode::OpNegate => Ok(12),
            OpCode::OpReturn => Ok(13),
        }
    }
}

We want bytecode that is as compact as possible so we use 1 byte per opcode. The repr(u8) annotation here instructs the Rust compiler to store the opcode enum using 1 byte of memory. Without this annotation, the compiler uses whatever layout is optimal for the enum at compile time. This annotation also means we can cast the Opcode enum to bytes and vice versa without loss of information; this is what happens with our bytecode. We want to be able to move back and forth between opcodes and bytes so we define TryFrom<u8> for OpCode and TryFrom<OpCode> for u8 traits for Opcodes to facilitate this. An alternative approach to interpret an Opcode as a byte is to use the as u8 cast but we choose to implement the TryFrom<OpCode> . This choice is more a style preference.

The Chunk type defined in Listing 2.0 below represents Sox bytcode.

#[derive(Debug, Clone)]
pub struct Chunk {
    pub code: Vec<u8>,
    pub constants: Vec<SoxObjectRef>,
    lines: Vec<usize>,
}

The code attribute is a vector of opcodes from Listing 1.0. The lines attribute is a vector of line numbers - the opcode at index, i, within the code vector has a line number at index, i, within the lines vector. The constants attribute is a vector of constant values within the source code. For instance, the chunk for the expression - 1+2, will have the 1 and 2 constants in the constants vector.

Listing 3.0 below is the implementation of helper methods for the Chunk type. These provide method for creating a new Chunk, writing to a Chunk, handling constants and disassembling Chunk code.

impl Default for Chunk {
    fn default() -> Self {
        Self::new()
    }
}

impl Chunk {
    pub fn new() -> Chunk {
        Chunk { code: Vec::new(), constants: vec![], lines: vec![] }
    }
    
    pub fn write_chunk(&mut self, data: u8, line: usize) {
        self.code.push(data);
        self.lines.push(line);
    }
    
    pub fn disassemble(&self, name: &str) {
        println!("== {} ==", name);
        let mut offset = 0;
        while offset < self.code.len() {
            offset = self.disassemble_instruction(offset);
            println!("{:?}", offset);
        }
        
    }
    
    pub fn disassemble_instruction(&self, offset: usize) -> usize {
        print!("{:04} ", offset);
        if offset > 0 && self.lines[offset] == self.lines[offset - 1] {
            print!("   | ");
        } else {
            print!("{:4} ", self.lines[offset]);
        }
        let opcode = self.code[offset].try_into().unwrap();
        
        match opcode {
            OpCode::OpReturn => { 
                self.simple_instruction("OpReturn", offset)
            }
            OpCode::OpConstant => {
                self.constant_instruction("OpConstant", offset)
            },
            OpCode::OpNegate => {
                self.simple_instruction("OpNegate", offset)
            }
            OpCode::OpAdd => {
                self.simple_instruction("OpAdd", offset)
            }
            OpCode::OpSubtract => {
                self.simple_instruction("OpSubtract", offset)
            }
            OpCode::OpMultiply => {
                self.simple_instruction("OpMultiply", offset)
            }
            OpCode::OpDivide => {
                self.simple_instruction("OpDivide", offset)
            },
            OpCode::OpNot => {
                self.simple_instruction("OpNot", offset)
            }
            OpCode::OpNone => {
                self.simple_instruction("OpNone", offset)
            }
            OpCode::OpTrue =>{
                self.simple_instruction("OpTrue", offset)
            } 
            OpCode::OpFalse => {
                self.simple_instruction("OpFalse", offset)
            },
            OpCode::OpEqual => {
                self.simple_instruction("OpEqual", offset)
            }
            OpCode::OpGreater => {
                self.simple_instruction("OpGreater", offset)
            }
            OpCode::OpLess => {
                self.simple_instruction("OpLess", offset)
            }
        }
    }
    
    fn simple_instruction(&self, opcode: &str, offset:usize) -> usize {
        println!("{}", opcode);
        offset+1
    }
    
    pub fn add_constant(&mut self, constant: SoxObjectRef) -> usize{
        self.constants.push(constant);
        self.constants.len() - 1
    }
    
    fn constant_instruction(&self, opcode: &str, offset:usize) -> usize {
        let constant_idx = self.code[offset+1] as usize;
        print!("{} ({:?})", opcode, constant_idx);
        println!("{:?}", self.constants[constant_idx]);
        offset+2
    }
}

The above is all there is to bytecode for now.

The VM

The basics of the Sox virtual machine are as simple as shown in the Figure 2.0 below.

At the core of the VM is a loop that reads and executes instructions one at a time. This Sox VM is a stack based VM; these types of VM load instructions into and read from a stack and are much simpler to implement. The VM type is defined as shown in below in Listing 4.0.

pub struct VirtualMachine {
    chunk: Chunk,
    ip: usize,
    stack: Vec<SoxObjectRef>,
    stack_top: usize,
}

The chunk attribute is the bytecode of instructions that the VM executes. This value is set to the output of the compiled source code supplied to the VM.
The ip atttribute is the instruction pointer, an index into the bytecode that tells the VM what the next instruction to read is.
The stack attribute is a stack data structure that is central in executing instructions. For example, the instruction for adding two numbers is something like below.
```
LOAD_CONSTANT 1 
LOAD_CONSTANT 2
ADD
```
These instructions all make use of the stack - the first load_constant instruction pushes 1 on to the stack while the second load_constant instruction pushes 2 onto the stack. The add operation, a binary operation, then causes the VM to pop the two values that have been loaded on the stack, add them together and push the result of the addition back onto the stack.
The stack_top is a convenience attribute that always points to the top of the stack.

Running the VM.

The VM’s run method is the entry point into executing the bytecode. It is here that the VM loops through the bytecode referenced by the chunk and executes the instructions specified by the bytecode. The followng snippet below shows the implemenation of three instructions -

a. OpAdd - this instructs the VM to add the top two values on the stack together and place the result back on the stack.

b. OpConstant - this instructs the VM to load a value from the chunk’s constant array into the VM’s stack.

The run loop is aided by a couple of simple macros that we discuss below.

 pub fn run(&mut self, i : &Interpreter) {
        while self.ip < self.chunk.code.len() {
            self.chunk.disassemble_instruction(self.ip);
            let inst = read_instr!(self);
            match inst {
                OpCode::OpAdd => {
                    binary_op!(self, i, add, number);
                }
                
                OpCode::OpConstant => {
                    let constant = read_constant!(self);
                    push_stack!(self, constant);
                    continue;
                }
                _ => {}
            }
        }
    }

The read_instr! macro below handles reading instructions from the chunk and updating the instruction pointer, ip.

macro_rules! read_instr {
    ($a:expr) => {{
        let instruction = $a.chunk.code[$a.ip];
        let opcode: OpCode = instruction.try_into().unwrap();
        $a.ip += 1;
        opcode
    }};
}

The binary_op! macro handles operations that require two operands as shown below. This is predominantly for arithmethic operations.

macro_rules! binary_op {
    ($vm:expr, $interpreter:expr, $slot_op:ident, $slot_name:ident) => {{
        let b = pop_stack!($vm);
        let a = pop_stack!($vm);
        let operation = a.typ().slots.$slot_name.as_ref().unwrap().$slot_op.unwrap();
        let res = (operation)(a, b, $interpreter).unwrap();
        push_stack!($vm, res);
    }};
}

push_stack! and pop_stack! push values onto the stack and pop values from the stack as shown below.

macro_rules! pop_stack {
    ($a:expr) => {{
        $a.stack_top -= 1;
        let v = $a.stack.pop().unwrap();
        v
    }};
}

macro_rules! push_stack {
    ($a:expr, $b:expr) => {{
        $a.stack.push( $b );
        $a.stack_top += 1;
    }};
}

The read_constant! macro loads a constant from the chunk.

macro_rules! read_constant {
    ($a:expr) => {{
        let instruction = read_instr!($a);
        let constant = $a.chunk.constants[instruction as usize].clone();
        constant
    }};
}

These simple but yet powerful construct described here form the basis for building out the Sox Virtual machine. We will put them together like lego blocks to build out an interpreter that matches the tree-walker interpreter in functionality and along the way we will learn the fundamentals of how interpreted languages like Python actually work under the hood. As usual you can find the complete code for this in the bytecode_vm branch of the Sox github repo. Next, we will look at the compiler that takes our source code and output bytecode that the VM can execute.

Building with Rust

Discussion about this post