[we]blog

Compiling Grammars to Bytecode

The previous post left the matcher paying for a tagged-union dispatch and an indirect call on every node. This post takes the conventional way out: compile the grammar to a flat array of bytecode and loop a small interpreter over the instructions. Same semantics, different machine. The bytecode shape is LPeg’s¹.

The Vocabulary

Twelve opcodes. Eleven split into three groups: atomic matchers, control flow with backtracking, and rule-call mechanics. The twelfth, match, is the program terminator.

pub const Opcode = enum(u8) {
    /// Match a single literal byte or fail.
    char,
    /// Match any single byte or fail.
    any,
    /// Match a byte within a 256-bit charset or fail.
    set,
    /// Match a byte NOT within a 256-bit charset or fail.
    neg_set,
    /// Push backtrack entry (current pos, L) onto the stack.
    /// On failure, restore pos and jump to L.
    choice,
    /// Pop backtrack entry, jump to L. Used after a choice succeeds.
    commit,
    /// Trigger backtracking: pop entry, restore pos, jump to saved L.
    fail,
    /// Pop backtrack entry then fail again. Implements `!e`.
    fail_twice,
    /// Unconditional jump to L.
    jump,
    /// Push return address, jump to L. Implements rule calls.
    call,
    /// Pop return address and jump to it.
    ret,
    /// Accept the match and halt.
    match,
};
Instruction.zig:6-32

The atomic matchers (char, any, set, neg_set) consume zero or one input bytes and either advance or fail. Failure isn’t fatal; it triggers backtracking.

The control-flow opcodes (choice, commit, fail, fail_twice, jump) operate on a backtracking stack. choice L pushes a backtrack frame: current position, jump target L. commit L drops the most recent backtrack frame and jumps to L. fail triggers an immediate backtrack. fail_twice pops one extra entry first, which is what makes PEG’s !e lookahead cheap.

call and ret do what they say. They share the same stack as choice points; the dispatch tells one from the other by tag.

Compiling Alternation

The most-quoted compilation pattern in any LPeg-derived implementation:

a / b   becomes
  choice L1
  <a>
  commit L2
L1:
  <b>
L2:

choice L1 says “if the alternative below me fails, jump to L1 with my saved position.” If a succeeds, commit L2 drops the backtrack point and skips past b. If a fails partway through, the VM falls through to the backtrack helper, restores position, and ends up at L1. The same mechanism chains for any number of alternatives:

fn compileAlternation(self: *Compiler, alts: []const Ast.Node) void {
    if (alts.len == 0) return;
    if (alts.len == 1) {
        self.compileNode(alts[0]);
        return;
    }

    var commits: [256]u32 = undefined;
    var commit_count: usize = 0;

    for (alts[0 .. alts.len - 1], 0..) |alt, i| {
        _ = i;
        const choice_addr = self.emitPlaceholder();
        self.compileNode(alt);
        commits[commit_count] = self.emitPlaceholder();
        commit_count += 1;
        // Patch choice to point to next alternative.
        self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
    }

    // Last alternative: no choice needed.
    self.compileNode(alts[alts.len - 1]);

    // Patch all commits to jump past the last alternative.
    const end = self.code_len;
    for (commits[0..commit_count]) |addr| {
        self.code[addr] = .{ .op = .commit, .data = .{ .offset = end } };
    }
}
Compiler.zig:143-171

The output is dense. a|b compiles to five instructions:

   0: choice  -> 3
   1: char    'a'
   2: commit  -> 4
   3: char    'b'
   4: match

That’s the entire VM image for a two-character alternation. Five opcodes, no AST nodes left.

Compiling Repetition

*e is alternation in disguise: try e, on success loop, on failure exit. The pattern:

L:
  choice Lend
  <e>
  commit L
Lend:

In code:

fn compileRepetition(self: *Compiler, rep: Ast.Repetition) void {
    // Emit min required copies.
    for (0..rep.min) |_| {
        self.compileNode(rep.element.*);
    }

    if (rep.max) |max| {
        // Bounded: emit (max - min) optional copies.
        // Each: choice Lskip; <e>; commit Lnext; Lskip:
        const optional = max - rep.min;
        for (0..optional) |_| {
            const choice_addr = self.emitPlaceholder();
            self.compileNode(rep.element.*);
            const commit_addr = self.emitPlaceholder();
            self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
            self.code[commit_addr] = .{ .op = .commit, .data = .{ .offset = self.code_len } };
        }
    } else {
        // Unbounded: e* = L: choice Lend; <e>; commit L; Lend:
        const loop_start = self.code_len;
        const choice_addr = self.emitPlaceholder();
        self.compileNode(rep.element.*);
        self.emit(.{ .op = .commit, .data = .{ .offset = loop_start } });
        self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
    }
}
Compiler.zig:174-199

a* compiles to:

   0: choice  -> 3
   1: char    'a'
   2: commit  -> 0
   3: match

Four instructions, one of which (match) is the program terminator. The loop is two: the choice/commit pair around the body. Each successful iteration commits back to position 0, rebuilding a fresh choice point; a failing body falls through to Lend with the saved position from that iteration’s start, not from before the loop.

Bounded repetition e{n,m} doesn’t get its own opcode. The compiler unrolls the lower bound and emits an optional-block per slot above it. a{2,3} compiles to:

   0: char    'a'
   1: char    'a'
   2: choice  -> 5
   3: char    'a'
   4: commit  -> 5
   5: match

Two mandatory copies, then one optional block: choice/body/commit. Wasteful for big bounds but conceptually clean: the VM never has to know what a repetition is.

[a-z]+ shows the unrolling. The + is {1,}, so the compiler emits one mandatory copy and then the unbounded loop:

   0: set     [a-z]
   1: choice  -> 4
   2: set     [a-z]
   3: commit  -> 1
   4: match

Two set instructions, same charset table entry both times.

The Not Predicate

PEG’s !e succeeds when e fails and consumes nothing either way. The naive compilation needs a way to invert “did this thing succeed.” The LPeg trick reuses the backtracking machinery:

        .not_predicate => |inner| {
            // !e: choice L; <e>; fail_twice; L:
            const choice_addr = self.emitPlaceholder();
            self.compileNode(inner.*);
            self.emit(.{ .op = .fail_twice });
            self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
        },
Compiler.zig:128-134

choice L
<e>
fail_twice
L:

Walk through both cases. If e succeeds: control reaches fail_twice, which pops the predicate’s own choice frame and then fails. With that frame gone, the failure escapes the predicate and propagates to whatever encloses it. If e fails: the choice frame catches it, position is restored, control jumps to L. Net effect: !e succeeds when e fails, fails when e succeeds, and consumes nothing either way.

fail_twice exists for exactly this pattern. Without it the predicate needs choice L; <e>; commit Lc; Lc: fail; L:, one extra instruction and one extra commit-then-fail dance per predicate. Predicates are common (lookahead is one of PEG’s selling points), so shaving a single opcode off them is a real win.

The Dispatch Loop

The VM is a flat while over a switch. Twelve opcodes, twelve arms. Here are the four backtracking-related ones:

            .choice => {
                stack[sp] = .{ .choice = .{ .pos = pos, .pc = inst.data.offset } };
                sp += 1;
                pc += 1;
            },
            .commit => {
                // Pop the backtrack entry (discard it) and jump.
                sp -= 1;
                pc = inst.data.offset;
            },
            .fail => {
                if (self.backtrack(&stack, &sp, &pc, &pos)) continue;
                return null;
            },
            .fail_twice => {
                // Pop one entry then fail.
                sp -= 1;
                if (self.backtrack(&stack, &sp, &pc, &pos)) continue;
                return null;
            },
Vm.zig:86-105

choice pushes; commit pops and jumps; fail and fail_twice route through the same backtrack helper, which is the only place that actually rewinds:

fn backtrack(self: *Vm, stack: *[max_stack]Entry, sp: *usize, pc: *u32, pos: *usize) bool {
    while (sp.* > 0) {
        sp.* -= 1;
        switch (stack[sp.*]) {
            .choice => |c| {
                if (self.trace) |t| {
                    t.writer.print("      backtrack -> pc={d} pos={d}\n", .{ c.pc, c.pos }) catch {};
                }
                pc.* = c.pc;
                pos.* = c.pos;
                return true;
            },
            .ret => {},
        }
    }
    return false;
}
Vm.zig:126-142

The stack stores either a choice frame (saved pos and pc) or a return address (just pc). On a fail we walk down popping ret entries and stop at the first choice we find; everything between was a chain of calls we’re not coming back from.

That’s why backtracking is cheap here. There’s no work to undo. We restore a position pointer and a program counter, and the input is read-only. No tree state to roll back, no scratch buffers to truncate, no half-built captures (yet) to discard.

Tooling

A flat array of Inst is opaque without a way to look at it. The disassembler turns the array into something readable, which is where all the 0: choice -> 3 listings above came from. The vm CLI subcommand dumps the bytecode for a grammar, optionally runs it, and with -t prints pc, sp, pos, the remaining input preview, and the decoded instruction at every step. Plus a backtrack -> pc=N pos=M line whenever the helper rewinds.

For grammars where the runtime behavior is non-obvious (most of them, until you’ve stared at the bytecode for a while), trace output is the difference between hours and minutes of debugging. The disassembler lives in its own file because pretty-printing has no business in the hot path; pulling it out keeps the VM’s execute loop a flat switch with no if (debug) ... arms.

What the Deal Is

We trade the single-pass simplicity of the tree walker for a separate compile phase, an Inst array, a backtrack stack, and a moderately bigger codebase. In return:

The dispatch is denser. A u8 opcode plus a switch over twelve arms beats the Ast.Node tagged-union dispatch from the tree walker, both in cache footprint and in branch-predictor cooperation.

The bytecode is also a value the optimizer can stare at. Peephole passes (common-prefix factoring, charset fusion, optional-char) become local rewrites on a flat array, not tree surgery.

Once the bytecode shape is fixed, the interpreter stops being the only execution strategy. Anything that consumes []Inst and produces the same answer is a valid backend. AOT to Zig source, JIT to native code, WASM out the side: same compiler, different consumer.

The tree walker was deliberately the simplest thing that could work. This is the second-simplest. The third move is to stop interpreting altogether.

Walking the AST