a design for simple interners

when writing a programming language in rust you often need some recursive datastructures, e.g. for types and the ast.

in rust recursive types need indirection, which can be achieved in a few different ways

Box everything

the most straightforward implementation is to use a Box and Vec for all recursive components, for example

pub enum Type {
    Function {
        arguments: Vec<Type>,
        ret: Box<Type>,
    },
    Pointer(Box<Type>),
    // ... and any other types you need
}

while this works, it is less than ideal as you can’t easily pattern match on your types and they definitely can’t be Copy or even be cloned cheaply.

index everything

another option is to store some sort of index for your recursive components and having an external storage for the actual types, which looks like this

pub struct TypeStorage {
    types: Vec<Type>,
    slices: Vec<Vec<Type>>,
    //   you could use a hashmap for deduplications here,
    //   not doing that due for brevity. probably needed as you
    //   otherwise end up with hundreds of copies for common types.
    // dedup_tys: HashMap<Type, TypeId>,
}

impl TypeStorage {
    pub fn mk_ty(&mut self, ty: Type) -> TypeId {
        let id = TypeId(self.types.len());
        self.types.push(ty);
        id
    }

    pub fn mk_slice(&mut self, slice: Vec<Type>) -> SliceId {
        let id = SliceId(self.slices.len());
        self.slices.push(slice);
        id
    }

    pub fn ty(&self, id: TypeId) -> Type {
        self.types[id.0]
    }

    pub fn slice(&self, id: SliceId) -> &[Type] {
        &self.slices[id.0]
    }
}

#[derive(Copy, Clone)]
pub struct TypeId(usize);
#[derive(Copy, Clone)]
pub struct SliceId(usize);

#[derive(Copy, Clone)]
pub enum Type {
    Function { arguments: SliceId, ret: TypeId },
    // ...
}

now our types can be copied cheaply, but we still can’t look into them during pattern matching.

intern me senpai

to allow both cheap copies and nice pattern matching, we ideally want something like

pub enum Type<'a> {
    Function {
        arguments: &'a [&'a Type<'a>],
        ret: &'a Type<'a>,
    }
    // the type `()` :shrug:
    Unit,
    // ...
}

which allows us to look into types during pattern matching, e.g. when looking for functions returning ()

match ty {
    Type::Function {
        arguments: _,
        ret: &Type::Unit,
    } => {} // uwu
    _ => {} // not uwu
}

this does seem nice, doesn’t it. for this we need a way to store new objects without invalidating references to previously stored ones.

this can’t really be done using just the safe subset of rust. luckily a bunch of people already implemented ways to do this.

for this post, i am going to use bumpalo, but any other arena allocator works just as well. to not allocate the same value in our arena multiple times, we’re also going to deduplicate the value before inserting it. this can be done by using a HashSet.

use bumpalo::Bump;
use std::collections::HashSet;
use std::hash::Hash;

pub struct Interner<'a, T> {
    set: HashSet<&'a T>,
    arena: &'a Bump,
}

impl<'a, T> Interner<'a, T> {
    pub fn new(arena: &'a Bump) -> Interner<'a, T> {
        Interner {
            set: HashSet::new(),
            arena,
        }
    }
}

impl<'a, T: Eq + Hash> Interner<'a, T> {
    pub fn intern(&mut self, value: T) -> &'a T {
        if let Some(value) = self.set.get(&value) {
            value
        } else {
            let value = self.arena.alloc(value);
            assert!(self.set.insert(value));
            value
        }
    }
}

we can now use the interner like the TypeStorage in the previous section ✨

perrrrrrf

note that by deduplicating our values before interning, there should only be one distinct object for each value. we should therefore be able to compare objects using only pointer equality. to do this, we have to prevent any objects from being created outside of our interner as comparing these objects would be incorrect. while there are probably a lot of projects doing this, i have first seen it in rustc itself.

a way to do this is by adding a new wrapper type which can only be created when interning a new value. for this we modify our interner as follows

use bumpalo::Bump;
use std::collections::HashSet;
use std::hash::{Hash, Hasher};
use std::ops::Deref;
use std::ptr;

// as the `T` field is private, this struct
// can only be constructed in this module.
#[derive(Copy, Clone)]
#[repr(transparent)]
pub struct Uniq<T>(T);
impl<T> Uniq<T> {
    pub fn value(self) -> T {
        self.0
    }
}
impl<T> Deref for Uniq<T> {
    type Target = T;
    fn deref(&self) -> &T {
        &self.0
    }
}
impl<T> PartialEq for Uniq<T> {
    fn eq(&self, other: &Self) -> bool {
        ptr::eq(self, other) // such fast
    }
}
impl<T> Eq for Uniq<T> {}
impl<T> Hash for Uniq<T> {
    fn hash<H: Hasher>(&self, s: &mut H) {
        (self as *const Uniq<T>).hash(s) // much wow
    }
}

pub struct Interner<'a, T> {
    set: HashSet<&'a T>,
    arena: &'a Bump,
}

impl<'a, T> Interner<'a, T> {
    pub fn new(arena: &'a Bump) -> Interner<'a, T> {
        Interner {
            set: HashSet::new(),
            arena,
        }
    }
}

impl<'a, T: Eq + Hash> Interner<'a, T> {
    pub fn intern(&mut self, value: T) -> &'a Uniq<T> {
        if let Some(value) = self.set.get(&value) {
            // 👻 `Uniq` is `#[repr(transparent)]` so this is sound 👻
            unsafe { std::mem::transmute::<&'a T, &'a Uniq<T>>(value) }
        } else {
            let value = self.arena.alloc(Uniq(value));
            assert!(self.set.insert(value));
            value
        }
    }
}

it probably makes sense to add a type alias to your project when using this, e.g.

type Ty<'a> = &'a Uniq<Type<'a>>;

edits

1) considering that most compilers don’t really drop their allocator, you can also use Box::leak(Box::new(value)) instead of an arena allocator. this means that you can replace the tbh fairly irrelevant lifetime with 'static everywhere (thank you eddyb for the the suggestion)

2) by using Uniq you restrict pattern matching again, so defining it as struct Uniq<T>(pub T, pub private::PrivateZst) is probably a good idea. users can now write the pattern Type::Function { arguments: _, ret: &Uniq(Type::Unit, _) } while still not being able to construct any objects of type Uniq themselves

3) it is often annoying to need a mutable reference to the interner, for example interner.intern(Foo(interner.intern(Bar))) does not work for mutable interners. using some sort of interior mutability for the HashSet is often a good idea

if you find any typos or errors in this post, please pm me on zulip, discord or cohost

back

impressum rss