Building an Interpreter: Reworking the object system

First make it work, then experiment...

Dec 26, 2024

Before working on the virtual machine interpreter for Sox, I had to re-engineer the Sox object system. This was motivated by -

Garbage Collection Aspirations: As part of the virtual machine design, we want to incorporate a rudimentary garbage collector. The current object system implementation uses Rust borrow checking infrastructure for memory management and this prevents us from explicitly managing memory, a necessity for demonstrating GC concepts. We need a system that would leave some garbage for our interpreter runtime to handle.
Exploration and Learning: This whole series is an exploratory use of Rust so implementing something as involved as the object system using multiple approaches is an excellent way to gain a deep understanding of the language.
Ergonomic Inefficiencies: The current implementation of the object system built on a large enum (SoxObject) proved cumbersome. Every time we implement new functionality, we have to exhaustively pattern match across all possible enum variants. Type checking also devolved into repetitive if-let expressions that felt very unidiomatic.

A new Object System.

The new object system uses unsafe Rust's raw pointers and pointer casting judiciously. It draws inspiration from projects like RustPython, that successfully employ similar techniques.

The new system must address the following key requirements:

Generic Object Representation: We need a way to represent a generic SoxObject that could hold any of our existing SoxType variants (like integers, floats, strings, etc.).
Concrete (Generic) Object Representation: We must be able to represent concrete types such as string, ints, floats etc
Multiple References: Both concrete and generic object instances must support multiple strong references within our runtime, enabling the sharing and manipulating of such objects.
Seamless Type Conversion: We need efficient and idiomatic methods for converting between the concrete types (SoxInt, SoxFloat, etc.) and the generic SoxObject representation.

Rust enums, trait objects, and pointers are various ways of representing types that can hold multiple types. We have already explored enums. Trait objects, while safe, would make demonstrating garbage collection difficult. This left us with pointers and unsafe Rust.

Key Components of the New Object System

Our new object system is best captured by the image below. On the right, we have the typed types and on the left, we have the untyped types. The SoxRef and SoxObjectRef types implement strong references to the SoxObject and Sox<T> objects.

Figure 1.0: The new types for the object system.

We start by defining a generic struct, SoxObjectInner<T>, the foundation for the new object system as Figure 1.0 shows.

#[repr(C)]
pub struct SoxObjectInner<T> {
    pub type_id: TypeId,
    pub typ: SoxRef<SoxType>,
    pub payload: T,
}

The type contains a type_id field for quick type comparison, a typ field to identify the underlying data type and a payload field of generic type T. The implementation of the SoxObjectInner type below shows the new method for creating pointers to new instances of SoxObjectInner.

impl<T: SoxObjectPayload> SoxObjectInner<T> {
    pub fn new(d: T, typ: SoxRef<SoxType>) -> Box<Self> {
        Box::new(SoxObjectInner {
            type_id: TypeId::of::<T>(),
            typ,
            payload: d,
        })
    }
}

The payload is restricted to types that implement the SoxObjectPayload trait so in this case, only builtin types can go into this payload field. Another important point to note is that when we create a new instance of this struct we return a Box<T> version of the type, T. A Box<T> is a safe pointer type that is managed by the Rust compiler. We use this because when we start pointer manipulation it is easy to get a raw pointer from a Box<T>

The next level up in Figure 1.0 contains the SoxObject and Sox<T> definitions shown below.

#[repr(transparent)]
pub struct SoxObject(SoxObjectInner<()>);

#[repr(transparent)]
pub struct Sox<T: SoxObjectPayload>(SoxObjectInner<T>);

Both SoxObject and Sox<T> types make use of the SoxObjectInner type - SoxObject’s only field is a SoxObjectInner type with a unit type as payload type. The SoxObjectInner<()> type is used to represent the fact that we have erased the type of the underlying payload when we create a SoxObject - the unit type in my opinion is the best type to represent this concept. Sox<T> is a rudimentary generic type that helps prevent code duplication across multiple Sox concrete types. Both structs have also been annotated with #[repr(transparent)] which instructs the compiler to use the same layout and ABI as the single field. In essence, the struct becomes transparent to the compiler acting as if it were just the inner field that is present.

At the last level are the SoxObjectRef and SoxRef<T> types defined below that provide multiple strong references to SoxObject and Sox<T> types respectively.

#[derive(Debug, Copy)]
#[repr(transparent)]
pub struct SoxObjectRef {
    pub ptr: NonNull<SoxObject>,
}

#[repr(transparent)]
pub struct SoxRef<T: SoxObjectPayload> {
    pub ptr: NonNull<Sox<T>>,
}

This is where it starts to get interesting because we introduce, NonNull, a pointer type based on raw pointers. Both types' references are just pointers to underlying owned types. Again do take note of the #[repr(transparent)] annotations on the structs. To create a new reference all we have to do is clone the reference object as we will see below.

The types defined above together with the existing types are all that are required for the new object system. However, to complete this object system, we need an API that makes converting from one type to another as easy and error-free as possible and the next section looks at some of the APIs that have been defined.

Other APIs

Creating a SoxRef<T>

Given a Sox<T> type, we want to be able to create a SoxRef<T> type that is needed by most of the interpreter methods. The implementation of the API for this is in the snippet below.

pub fn new_ref(payload: T, typ: SoxRef<SoxType>) -> SoxRef<T> {
    let inner = Box::into_raw(SoxObjectInner::new(payload, typ));
    Self {
        ptr: unsafe { NonNull::new_unchecked(inner.cast::<Sox<T>>()) },
    }
}

Most of the snippet is self-explanatory. The cast from SoxObjectInner<T> to Sox<T> is sound because if you recall, a Sox<T> is defined as pub struct Sox<T: SoxObjectPayload>(SoxObjectInner<T>); with the #repr[(transparent)] annotation meaning that to the compiler, a Sox<T> type is equivalent to the inner SoxObjectInner<T> so a cast and subsequent read will never fail.

Creating a SoxObjectRef from a SoxRef<T>

Most methods of the interpreter type operate on SoxObject instances via SoxObjectRef. Therefore, any SoxRef<T> must be converted to a SoxObjectRef type. This conversion is achieved through a pointer cast as the following snippet shows.

impl<T: SoxObjectPayload> From<SoxRef<T>> for SoxObjectRef {
    fn from(value: SoxRef<T>) -> Self {
        Self { ptr: value.ptr.cast() }
    }
}

In the method above, the .cast() method performs a raw pointer cast from *mut SoxObjectInner<T> to *mut SoxObjectInner<()>, which is the type of the ptr field in SoxObjectRef. This cast is valid in Rust, as raw pointer casts are always permitted. Issues could potentially arise when retrieving the payload from a SoxObjectRef. To address this, we provide a payload method for safely accessing the payload. The SoxObjectInner struct is annotated with #[repr(C)]. This ensures that the struct's fields are laid out in memory exactly as defined in the code, preventing the compiler from reordering them for optimization. This is crucial because, even though the payload's type has been effectively erased at the SoxObjectRef level, we retain the necessary type information in the type_id field and can access the correct value because the type_id value will always be at the given offset from the start of the struct.

The definition of the payload method and its supporting functions are given below.

pub fn payload<T: SoxObjectPayload>(&self) -> Option<&T> {
    if self.payload_is::<T>() {
        Some(unsafe { self.payload_unchecked() })
    } else {
        None
    }
}

pub fn payload_is<T: SoxObjectPayload>(&self) -> bool {
    unsafe { self.ptr.as_ref().0.type_id == TypeId::of::<T>() }
}

pub unsafe fn payload_unchecked<T: SoxObjectPayload>(&self) -> &T {
    let v = self.ptr.as_ref();
    let inner = unsafe { &*(v as *const SoxObject as *const SoxObjectInner<T>) };
        &inner.payload
}

When accessing a type (e.g., obj. payload::<SoxInt>()), the payload method first calls payload_is::<T>() to check if the type_id of the stored object matches the TypeId of the requested type T. This type check is performed within an unsafe block because we are dereferencing a raw pointer. If the types match, the payload_unchecked::<T>() method is called within an unsafe block to perform the actual pointer cast and access the payload. This unsafe block is justified by the prior type check, which ensures that the cast is valid. The raw pointer is cast from *const SoxObject to *const SoxObjectInner<T>, and then the payload field of the resulting SoxObjectInner<T> is accessed. This two-step process ensures type safety while working with the type-erased SoxObjectRef.

Creating a Sox<T> from SoxObjectRef

In multiple instances within our interpreter, we will have SoxObjectRef but need a Sox<T>. An example of this is when we call a method such as repr on SoxObjectRef. In such an instance, we need to downcast SoxObjectRef to the payload's type. This also involves the use of cast as the following snippet shows.

pub fn downcast_ref<T: SoxObjectPayload>(&self) -> Option<&Sox<T>> {
    if self.payload_is::<T>() {
        Some(unsafe { self.downcast_unchecked_ref::<T>() })
    } else {
        None
    }
}

pub unsafe fn downcast_unchecked_ref<T: SoxObjectPayload>(&self) -> &Sox<T> {
    &*(self as *const SoxObjectRef as *const SoxRef<T>)
}

The downcast_ref<T> function attempts a safe downcast of a SoxObjectRef to a reference to Sox<T>. It first uses self.payload_is::<T>() to perform a type check. This function (defined elsewhere) determines whether the underlying payload of the SoxObjectRef is of the expected type T. If the type check succeeds, the downcast_unchecked_ref::<T> function is called within an unsafe block to perform the actual downcast. If the type check fails, downcast_ref returns None.

The downcast_unchecked_ref::<T> function performs the following steps:

self as *const SoxObjectRef: Converts the reference &self (of type &SoxObjectRef) to a raw const pointer of type *const SoxObjectRef.
as *const SoxRef<T>: Performs a raw pointer cast from *const SoxObjectRef to *const SoxRef<T>. This is the core downcasting operation and is marked unsafe because the compiler cannot statically guarantee its validity. This cast assumes that the memory layout of SoxObjectRef and SoxRef<T> is compatible, which is ensured through a combination of the use of #[repr(C)] and #[repr(transparent)] on the underlying structs.
&*(...): Dereferences the resulting raw pointer *const SoxRef<T> to produce a reference &Sox<T>.

The unsafe block in downcast_unchecked_ref is considered sound due to the prior type check performed by payload_is::<T> in downcast_ref. This check guarantees the raw pointer cast is valid at runtime, preventing undefined behaviour. This pattern ensures type safety at the API level.

Miscellaneous APIs

We also define several other traits such as Clone, Deref, Borrow, etc that make the whole system more ergonomic. See the bytecode_vm branch on GitHub for a complete implementation.

One last item…

Figure 2.0 is the type hierarchy for a Sox object. The integer's type is SoxInt, and SoxInt's type is SoxType.

Figure 2.0: The types of various sox objects

SoxType has a recursive relationship with itself and we chose to employ another set of unsafe Rust primitives to implement this relationship. The function definition below shows how this is implemented.

pub fn init_type_type() -> (SoxRef<SoxType>) {
    let typ = {
        let type_payload = SoxType {
            base: None,
            methods: Default::default(),
            slots: Default::default(),
            attributes: Default::default(),
            name: Some("SoxType".to_owned()),
        };
        
        let type_type_ptr = Box::into_raw(Box::new(MaybeUninit::<SoxInner::<SoxType>>::uninit())) as *mut SoxInner<SoxType>;
        unsafe {
            ptr::write(&mut (*type_type_ptr).type_id, (TypeId::of::<SoxType>()));
            ptr::write(&mut (*type_type_ptr).payload, type_payload);
        
            let type_type = SoxRef::<SoxType>::from_raw(type_type_ptr.cast());
            ptr::write(&mut (*type_type_ptr).typ, type_type.clone());
            type_type
        }
    };
    typ
}

First, we create the payload, type_payload, which is a SoxType instance. Then we create a SoxObjectInner using MaybeUnit. This means that fields are uninitialized and in an undefined state so we have to manually initialize the fields of the struct. We use ptr_write to manually set these fields. Finally, the recursive relationship is established when we set the typ to type_type, an object that is created from casting type_type_ptr which is a pointer to our MaybeUnit struct.

With an object system that I feel is more ergonomic and casn easily support the additon of a garbage collector, it is now time to start work on the bytecode virtual machine for our interpreter. As usual, the complete code for this available on the bytecode_vm branch of the Sox language github repo.

Building with Rust

Discussion about this post