[we]blog

Walking the AST

Matching an ABNF grammar against input is two questions: how does alternation choose between branches, and how does concatenation react when an already-committed prefix turns out to be wrong? Different parsing strategies answer differently. The runtime matcher in zpars takes the simplest combination on both axes (first match wins for alternation, no rollback for concatenation) and accepts the consequences.

Those two choices push the matcher onto the PEG side of the parsing landscape, even though ABNF as defined in RFC 5234 is a context-free grammar. ABNF documents on the wire don’t usually exercise the difference, but the difference exists and is worth being explicit about.

The Shape

The matcher is a tree-walking interpreter over Ast.Node. Input is a slice; output is the matched prefix and what’s left:

pub const Result = struct {
    /// The matched input span.
    value: []const u8,
    /// Unconsumed input after the match.
    rest: []const u8,
};
Matcher.zig:21-26

The public entry point dispatches by rule name:

/// Match `input` against the rule named `rule_name`.
/// Returns null if the rule is not found or the input does not match.
pub fn match(self: *const Matcher, rule_name: []const u8, input: []const u8) ?Result {
    return self.matchRulename(rule_name, input, 0);
}
Matcher.zig:37-41

Failure is null, not an error. A ?Result is a half-word: nothing to allocate, nothing to format, no exception machinery. Match failure is a value, not an exceptional condition; the caller may try a different rule, walk a different branch of an alternation, or stop. Most calls to matchNode fail; treating that as exceptional would be a category error.

The dispatch itself is the tree-walk:

fn matchNode(self: *const Matcher, node: Ast.Node, input: []const u8, depth: usize) ?Result {
    if (depth > max_depth) return null;

    return switch (node) {
        .char_val => |cv| matchCharVal(cv, input),
        .num_val => |nv| matchNumVal(nv, input),
        .prose_val => null,
        .rulename => |name| self.matchRulename(name, input, depth),
        .alternation => |alts| self.matchAlternation(alts, input, depth),
        .concatenation => |elems| self.matchConcatenation(elems, input, depth),
        .repetition => |rep| self.matchRepetition(rep, input, depth),
    };
}
Matcher.zig:45-57

One switch per node, recursing into the variant’s handler. The grammar isn’t compiled. It’s a value the CPU navigates at every step, and that has a cost: a tagged-union check and an indirect call per node, every time. That cost is the reason a comptime version exists too. More on that at the end.

Trade 1: Ordered Choice

ABNF’s alternation operator / is a set of alternatives. RFC 5234 doesn’t prescribe how to choose between them when more than one matches. A CFG-faithful parser would either return all valid parses (ambiguity) or pick the longest, doing extra work either way.

This matcher does neither:

fn matchAlternation(self: *const Matcher, alts: []const Ast.Node, input: []const u8, depth: usize) ?Result {
    for (alts) |alt| {
        if (self.matchNode(alt, input, depth)) |r| return r;
    }
    return null;
}
Matcher.zig:90-95

return r on the first hit. We never look at later alternatives. This is ordered choice in the sense of Bryan Ford’s PEG paper¹: the operator stops being symmetric and starts caring about left-to-right order.

A grammar like

rule = a / b
a    = "x"
b    = "xy"

matched against xy consumes one character and stops, because a matches first. A CFG-style parser would see two valid parses (a, leaving y; or b, leaving nothing) and either flag the ambiguity or prefer the longest. This matcher commits to a and walks away.

In practice this rarely matters. ABNF authors write their alternations with an implicit ordering anyway: specific cases first, general cases last, because that’s how grammar review tends to flow. The cases where ordered choice diverges from the longest-match interpretation are typically grammar bugs the author would want to fix.

What ordered choice buys: every alternation evaluation is bounded by the position of the first success. We never re-enter a branch we’ve left. That matters more for the next trade.

Trade 2: Concatenation Commits

Concatenation in ABNF is just juxtaposition: a b c matches a followed by b followed by c. The interesting question is what happens if b fails after a succeeded. Does the matcher try a different match for a?

It doesn’t:

fn matchConcatenation(self: *const Matcher, elems: []const Ast.Node, input: []const u8, depth: usize) ?Result {
    var rest = input;
    for (elems) |elem| {
        const r = self.matchNode(elem, rest, depth) orelse return null;
        rest = r.rest;
    }
    return .{ .value = input[0 .. input.len - rest.len], .rest = rest };
}
Matcher.zig:97-104

Each iteration commits its consumed prefix via rest = r.rest. There’s no stack of choice points, no saved positions to roll back to. The first time an element fails, the whole concatenation fails.

Combined with greedy repetition, this produces a class of grammars where a valid parse exists but the matcher can’t find it. The clean illustration:

rule = *DIGIT "1"

matched against 111. The *DIGIT is greedy: with no upper bound, it keeps going while DIGIT matches. That consumes all three characters, and then "1" has nothing left to match against. Fail.

A backtracking matcher would notice and try shorter matches for *DIGIT: 11, then 1, then "", retrying "1" against the remainder each time. One of those attempts succeeds. This matcher does no such thing.

This is the deliberate trade. Backtracking parsers can blow up exponentially on certain grammars (the famous catastrophic-backtracking class²) because each choice point multiplies the work in the worst case.

The non-backtracking version is worst-case bounded by the input length times the grammar depth, with no exponential trap. The price is that some CFG-valid grammars fail.

The mitigation isn’t algorithmic; it’s the convention that ABNF authors write grammars that work under ordered, greedy, non-backtracking semantics. *DIGIT "1" would be replaced by 1*DIGIT if the trailing constraint were intended, or by *DIGIT *("1") if the original semantics were really wanted. Most actual ABNF in published RFCs is already in this shape because it’s been implemented this way for decades.

A Footgun in Repetition

While we’re here, the repetition handler has one more piece of trade-craft:

fn matchRepetition(self: *const Matcher, rep: Ast.Repetition, input: []const u8, depth: usize) ?Result {
    var rest = input;
    var count: usize = 0;

    while (rep.max == null or count < rep.max.?) {
        const r = self.matchNode(rep.element.*, rest, depth) orelse break;
        // Guard against zero-length matches causing infinite loops.
        if (r.rest.len == rest.len) break;
        rest = r.rest;
        count += 1;
    }

    if (count < rep.min) return null;
    return .{ .value = input[0 .. input.len - rest.len], .rest = rest };
}
Matcher.zig:106-120

Look at line 113: if (r.rest.len == rest.len) break;. That’s the zero-length-match guard. Without it, a grammar like

rule = *("")

would loop forever: the body matches successfully, consumes nothing, and the loop runs again with the same input. The guard breaks out the moment a match doesn’t advance the cursor. Two lines of code, one entire class of grammar bugs prevented.

The guard is necessary because the grammar comes from outside the program. A handwritten matcher could promise its own bodies always advance; a matcher that runs user-supplied grammars cannot. Defensive code earns its keep when the input is data.

What the Matcher Refuses

A few small refusals worth naming.

prose_val returns null, always. RFC 5234 §4 defines <prose> as informational text: the grammar author punted, asking a human reader to fill in semantics the notation can’t capture. The matcher honors that: a prose value never matches, and the rule containing it consequently fails. Implementations that try to be helpful by partial-matching or always-true prose are introducing semantics the spec explicitly excluded. Better to fail loudly than to invent.

Recursion caps at 256 frames. A grammar deep enough to bottom out the stack is almost certainly broken: left recursion the validator missed, or some pathological right-recursive structure. The validator already flags left recursion, so this is belt-and-suspenders for the rare case where it slips through (or where the grammar shape is fine but the input exercises absurd nesting). 256 is generous; ABNF rules in published RFCs nest two or three levels deep at most.

Rule names are looked up case-insensitively. RFC 5234 §2.1: “Rule names are case insensitive.” The matcher inherits the case-insensitive index the validator already builds, so Foo and foo and FOO all resolve to the same rule. This sounds trivial but it’s exactly the kind of conformance detail that quietly breaks downstream consumers when an implementation gets it wrong.

What This Doesn’t Do

The matcher is small (under 300 lines including core-rule definitions) and predictable. It also leaves performance on the floor. Every node walk goes through the matchNode switch: a tagged-union discriminant, an indirect call into the variant handler, a stack frame. The grammar is data the CPU has to navigate; nothing is specialized for any particular grammar.

The alternative, visible in zpars but separately, is to compile the grammar at compile time. Each rule becomes a Zig type, each combinator becomes a generated function, the indirect-dispatch overhead disappears. Same input/output contract, same semantics, completely different execution model. The post after this one is about that.

There’s also the question of memoization. PEG with packrat memoization (caching (rule, position) results) is provably linear-time at the cost of O(input × rules) memory. The non-memoizing version here can do redundant work when a sub-grammar is re-tried at the same position, but for ABNF-shaped grammars (small rule count, modest input size, no deep recursive recombination) the constant factor of the cache lookup tends to swamp the savings. There’s a future post in there too, but only when the project has a workload that motivates measuring it.

All the Errors at Once
Compiling Grammars to Bytecode