Compiling Grammars to Bytecode
The previous post left the matcher paying for a tagged-union dispatch and an indirect call on every node. This post takes the conventional way out: compile the grammar to a flat array of bytecode and loop a small interpreter over the instructions. Same semantics, different machine. The bytecode shape is LPeg’s¹.
The Vocabulary
Twelve opcodes. Eleven split into three groups: atomic matchers, control flow with backtracking, and rule-call mechanics. The twelfth, match, is the program terminator.
pub const Opcode = enum(u8) {
/// Match a single literal byte or fail.
char,
/// Match any single byte or fail.
any,
/// Match a byte within a 256-bit charset or fail.
set,
/// Match a byte NOT within a 256-bit charset or fail.
neg_set,
/// Push backtrack entry (current pos, L) onto the stack.
/// On failure, restore pos and jump to L.
choice,
/// Pop backtrack entry, jump to L. Used after a choice succeeds.
commit,
/// Trigger backtracking: pop entry, restore pos, jump to saved L.
fail,
/// Pop backtrack entry then fail again. Implements `!e`.
fail_twice,
/// Unconditional jump to L.
jump,
/// Push return address, jump to L. Implements rule calls.
call,
/// Pop return address and jump to it.
ret,
/// Accept the match and halt.
match,
};
The atomic matchers (char, any, set, neg_set) consume zero or one input bytes and either advance or fail. Failure isn’t fatal; it triggers backtracking.
The control-flow opcodes (choice, commit, fail, fail_twice, jump) operate on a backtracking stack. choice L pushes a backtrack frame: current position, jump target L. commit L drops the most recent backtrack frame and jumps to L. fail triggers an immediate backtrack. fail_twice pops one extra entry first, which is what makes PEG’s !e lookahead cheap.
call and ret do what they say. They share the same stack as choice points; the dispatch tells one from the other by tag.
Compiling Alternation
The most-quoted compilation pattern in any LPeg-derived implementation:
a / b becomes
choice L1
<a>
commit L2
L1:
<b>
L2:
choice L1 says “if the alternative below me fails, jump to L1 with my saved position.” If a succeeds, commit L2 drops the backtrack point and skips past b. If a fails partway through, the VM falls through to the backtrack helper, restores position, and ends up at L1. The same mechanism chains for any number of alternatives:
fn compileAlternation(self: *Compiler, alts: []const Ast.Node) void {
if (alts.len == 0) return;
if (alts.len == 1) {
self.compileNode(alts[0]);
return;
}
var commits: [256]u32 = undefined;
var commit_count: usize = 0;
for (alts[0 .. alts.len - 1], 0..) |alt, i| {
_ = i;
const choice_addr = self.emitPlaceholder();
self.compileNode(alt);
commits[commit_count] = self.emitPlaceholder();
commit_count += 1;
// Patch choice to point to next alternative.
self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
}
// Last alternative: no choice needed.
self.compileNode(alts[alts.len - 1]);
// Patch all commits to jump past the last alternative.
const end = self.code_len;
for (commits[0..commit_count]) |addr| {
self.code[addr] = .{ .op = .commit, .data = .{ .offset = end } };
}
}
The output is dense. a|b compiles to five instructions:
0: choice -> 3
1: char 'a'
2: commit -> 4
3: char 'b'
4: match
That’s the entire VM image for a two-character alternation. Five opcodes, no AST nodes left.
Compiling Repetition
*e is alternation in disguise: try e, on success loop, on failure exit. The pattern:
L:
choice Lend
<e>
commit L
Lend:
In code:
fn compileRepetition(self: *Compiler, rep: Ast.Repetition) void {
// Emit min required copies.
for (0..rep.min) |_| {
self.compileNode(rep.element.*);
}
if (rep.max) |max| {
// Bounded: emit (max - min) optional copies.
// Each: choice Lskip; <e>; commit Lnext; Lskip:
const optional = max - rep.min;
for (0..optional) |_| {
const choice_addr = self.emitPlaceholder();
self.compileNode(rep.element.*);
const commit_addr = self.emitPlaceholder();
self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
self.code[commit_addr] = .{ .op = .commit, .data = .{ .offset = self.code_len } };
}
} else {
// Unbounded: e* = L: choice Lend; <e>; commit L; Lend:
const loop_start = self.code_len;
const choice_addr = self.emitPlaceholder();
self.compileNode(rep.element.*);
self.emit(.{ .op = .commit, .data = .{ .offset = loop_start } });
self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
}
}
a* compiles to:
0: choice -> 3
1: char 'a'
2: commit -> 0
3: match
Four instructions, one of which (match) is the program terminator. The loop is two: the choice/commit pair around the body. Each successful iteration commits back to position 0, rebuilding a fresh choice point; a failing body falls through to Lend with the saved position from that iteration’s start, not from before the loop.
Bounded repetition e{n,m} doesn’t get its own opcode. The compiler unrolls the lower bound and emits an optional-block per slot above it. a{2,3} compiles to:
0: char 'a'
1: char 'a'
2: choice -> 5
3: char 'a'
4: commit -> 5
5: match
Two mandatory copies, then one optional block: choice/body/commit. Wasteful for big bounds but conceptually clean: the VM never has to know what a repetition is.
[a-z]+ shows the unrolling. The + is {1,}, so the compiler emits one mandatory copy and then the unbounded loop:
0: set [a-z]
1: choice -> 4
2: set [a-z]
3: commit -> 1
4: match
Two set instructions, same charset table entry both times.
The Not Predicate
PEG’s !e succeeds when e fails and consumes nothing either way. The naive compilation needs a way to invert “did this thing succeed.” The LPeg trick reuses the backtracking machinery:
.not_predicate => |inner| {
// !e: choice L; <e>; fail_twice; L:
const choice_addr = self.emitPlaceholder();
self.compileNode(inner.*);
self.emit(.{ .op = .fail_twice });
self.code[choice_addr] = .{ .op = .choice, .data = .{ .offset = self.code_len } };
},
choice L
<e>
fail_twice
L:
Walk through both cases. If e succeeds: control reaches fail_twice, which pops the predicate’s own choice frame and then fails. With that frame gone, the failure escapes the predicate and propagates to whatever encloses it. If e fails: the choice frame catches it, position is restored, control jumps to L. Net effect: !e succeeds when e fails, fails when e succeeds, and consumes nothing either way.
fail_twice exists for exactly this pattern. Without it the predicate needs choice L; <e>; commit Lc; Lc: fail; L:, one extra instruction and one extra commit-then-fail dance per predicate. Predicates are common (lookahead is one of PEG’s selling points), so shaving a single opcode off them is a real win.
The Dispatch Loop
The VM is a flat while over a switch. Twelve opcodes, twelve arms. Here are the four backtracking-related ones:
.choice => {
stack[sp] = .{ .choice = .{ .pos = pos, .pc = inst.data.offset } };
sp += 1;
pc += 1;
},
.commit => {
// Pop the backtrack entry (discard it) and jump.
sp -= 1;
pc = inst.data.offset;
},
.fail => {
if (self.backtrack(&stack, &sp, &pc, &pos)) continue;
return null;
},
.fail_twice => {
// Pop one entry then fail.
sp -= 1;
if (self.backtrack(&stack, &sp, &pc, &pos)) continue;
return null;
},
choice pushes; commit pops and jumps; fail and fail_twice route through the same backtrack helper, which is the only place that actually rewinds:
fn backtrack(self: *Vm, stack: *[max_stack]Entry, sp: *usize, pc: *u32, pos: *usize) bool {
while (sp.* > 0) {
sp.* -= 1;
switch (stack[sp.*]) {
.choice => |c| {
if (self.trace) |t| {
t.writer.print(" backtrack -> pc={d} pos={d}\n", .{ c.pc, c.pos }) catch {};
}
pc.* = c.pc;
pos.* = c.pos;
return true;
},
.ret => {},
}
}
return false;
}
The stack stores either a choice frame (saved pos and pc) or a return address (just pc). On a fail we walk down popping ret entries and stop at the first choice we find; everything between was a chain of calls we’re not coming back from.
That’s why backtracking is cheap here. There’s no work to undo. We restore a position pointer and a program counter, and the input is read-only. No tree state to roll back, no scratch buffers to truncate, no half-built captures (yet) to discard.
Tooling
A flat array of Inst is opaque without a way to look at it. The disassembler turns the array into something readable, which is where all the 0: choice -> 3 listings above came from. The vm CLI subcommand dumps the bytecode for a grammar, optionally runs it, and with -t prints pc, sp, pos, the remaining input preview, and the decoded instruction at every step. Plus a backtrack -> pc=N pos=M line whenever the helper rewinds.
For grammars where the runtime behavior is non-obvious (most of them, until you’ve stared at the bytecode for a while), trace output is the difference between hours and minutes of debugging. The disassembler lives in its own file because pretty-printing has no business in the hot path; pulling it out keeps the VM’s execute loop a flat switch with no if (debug) ... arms.
What the Deal Is
We trade the single-pass simplicity of the tree walker for a separate compile phase, an Inst array, a backtrack stack, and a moderately bigger codebase. In return:
The dispatch is denser. A u8 opcode plus a switch over twelve arms beats the Ast.Node tagged-union dispatch from the tree walker, both in cache footprint and in branch-predictor cooperation.
The bytecode is also a value the optimizer can stare at. Peephole passes (common-prefix factoring, charset fusion, optional-char) become local rewrites on a flat array, not tree surgery.
Once the bytecode shape is fixed, the interpreter stops being the only execution strategy. Anything that consumes []Inst and produces the same answer is a valid backend. AOT to Zig source, JIT to native code, WASM out the side: same compiler, different consumer.
The tree walker was deliberately the simplest thing that could work. This is the second-simplest. The third move is to stop interpreting altogether.