[we]blog

All the Errors at Once

The parser from last time had a problem: on the first malformed token it gave up. One bad rule meant nothing else in the file got parsed. And even on a clean parse, plenty of broken grammars sneak through. A rule referencing something undefined. A rule defined but never used. A pair of rules that recurse into each other forever with no terminal to anchor them.

This post fixes both halves of that story:

  1. Recovery in the parser: catch a syntax error, skip to the next rule, keep going. Accumulate all syntax diagnostics instead of bailing on the first.
  2. A validator that runs on the AST after parsing succeeds, catching the semantic problems the parser fundamentally can’t see.

About 470 lines across Parser.zig and a new Validator.zig.

Recovery

The previous parser stored ?Diagnostic (a single optional) and threw error.SyntaxError on the first failure. The fix has two pieces: turn that field into a list, then stop bailing.

const std = @import("std");
const Token = @import("Token.zig");
const Ast = @import("Ast.zig");
const Diagnostic = @import("Diagnostic.zig");

const Parser = @This();

pub const ParseError = error{ SyntaxError, OutOfMemory, Overflow, InvalidCharacter };

tokens: []const Token,
source: []const u8,
allocator: std.mem.Allocator,
pos: usize,
diagnostics: std.ArrayList(Diagnostic) = .empty,
Parser.zig:1-14

diagnostics is now an ArrayList. The fail helper appends to it instead of overwriting:

fn fail(self: *Parser, expected: Diagnostic.Expected, token: Token) error{OutOfMemory}!void {
    try self.diagnostics.append(self.allocator, .{
        .expected = expected,
        .found_tag = token.tag,
        .found_start = token.start,
        .found_len = token.len,
        .line = token.line,
    });
}
Parser.zig:246-254

And parse() catches the syntax error, calls synchronize(), and loops back to try the next rule:

/// Parse all rules from the token stream.
pub fn parse(self: *Parser) ![]const Ast.Rule {
    var rules: std.ArrayList(Ast.Rule) = .empty;

    self.skipTrivia();
    while (self.peek().tag != .eof) {
        const rule = self.parseRule() catch |err| switch (err) {
            error.SyntaxError => {
                self.synchronize();
                self.skipTrivia();
                continue;
            },
            else => return err,
        };

        try rules.append(self.allocator, rule);
        self.skipTrivia();
    }

    return try rules.toOwnedSlice(self.allocator);
}
Parser.zig:25-45

synchronize is the recovery primitive. It skips tokens until it finds something that looks like the start of a new rule: a rulename followed by = or =/.

/// Skip tokens until the start of the next rule or EOF.
/// A rule boundary is a `rulename` followed by `=` or `=/`.
fn synchronize(self: *Parser) void {
    while (self.peek().tag != .eof) {
        if (self.peek().tag == .rulename) {
            const next = self.peekNextMeaningful();
            if (next == .equals or next == .equals_slash) return;
        }
        self.pos += 1;
    }
}
Parser.zig:287-297

That’s the entire recovery story. The parser doesn’t try to repair malformed input. It just abandons whatever it was parsing and restarts at the next rule boundary. Cheap, robust, easy to reason about.

The lookahead used here (peekNextMeaningful) is the same one isAtRepetition used to know when concatenation has run out of elements. The rule-boundary heuristic is consistent across the parser.

Validation

Recovery makes the parser robust to bad syntax. But syntactically valid grammars can still be broken in ways the parser fundamentally can’t see:

foo = bar          ; bar is never defined
unused = "x"       ; defined but never referenced
a = b              ; a → b → a, no terminal — never matches anything
b = a

A separate Validator pass runs on the AST after parsing, with the same accumulator pattern as the parser:

const std = @import("std");
const Ast = @import("Ast.zig");

const Validator = @This();

rules: []const Ast.Rule,
allocator: std.mem.Allocator,
diagnostics: std.ArrayList(Validation) = .empty,
Validator.zig:1-8

pub const Validation = struct {
    kind: Kind,
    /// The rule in which the issue was found.
    rule_name: []const u8,
    /// The undefined reference name (only set for `.undefined_rule`).
    ref_name: ?[]const u8 = null,

    pub const Kind = enum {
        duplicate_rule,
        undefined_rule,
        unused_rule,
        unproductive_rule,
    };
};
Validator.zig:10-23

Four kinds of issues, all collected into a single list. Same model as the parser’s diagnostics: the caller decides what to do with them (treat as errors, treat as warnings, format and exit, etc.).

validate() runs the rule list through a series of stages. I’ll skip the merging-and-duplicate-detection pass¹ and pull out the more interesting ones.

Undefined References

Walk the AST, collect every rulename reference into a set:

fn collectRefs(self: *Validator, node: Ast.Node, refs: *std.StringHashMap(void)) !void {
    switch (node) {
        .rulename => |name| try refs.put(name, {}),
        .alternation => |items| for (items) |item| try self.collectRefs(item, refs),
        .concatenation => |items| for (items) |item| try self.collectRefs(item, refs),
        .repetition => |rep| try self.collectRefs(rep.element.*, refs),
        .char_val, .num_val, .prose_val => {},
    }
}
Validator.zig:148-156

Then check each reference against the defined-rule index, skipping the standard library. ABNF’s core rules like ALPHA, DIGIT, CRLF are implicitly available without needing a definition.

    // Stage 3: check for undefined references.
    var ref_iter = refs.keyIterator();
    while (ref_iter.next()) |ref_name| {
        if (name_index.contains(ref_name.*)) continue;
        if (isCoreRule(ref_name.*)) continue;
        // Find the first rule that references this undefined name.
        const owner = self.findReferencer(merged_rules, ref_name.*);
        try self.diagnostics.append(self.allocator, .{
            .kind = .undefined_rule,
            .rule_name = owner orelse ref_name.*,
            .ref_name = ref_name.*,
        });
    }
Validator.zig:78-90

Unused Rules

The mirror image: a rule that’s defined but never appears in the reference set. With one wrinkle: the first rule is the start symbol, so it’s never referenced and not unused.

    // Stage 4: unused rules (skip index 0 — start symbol).
    for (merged_rules[1..]) |rule| {
        if (!refs.contains(rule.name)) {
            try self.diagnostics.append(self.allocator, .{
                .kind = .unused_rule,
                .rule_name = rule.name,
            });
        }
    }
Validator.zig:92-100

Productivity (Cycle Detection)

The interesting one. A rule is productive if it can match some finite string. a = "x" is productive. a = b followed by b = a is not: the cycle never bottoms out at a terminal.

The classic fixed-point algorithm: assume nothing is productive, then iterate. On each pass, mark a rule productive if its body is. Repeat until nothing changes:

    // Stage 5: productivity / cycle detection.
    var productive = try self.allocator.alloc(bool, merged_rules.len);
    defer self.allocator.free(productive);
    @memset(productive, false);

    var changed = true;
    while (changed) {
        changed = false;
        for (merged_rules, 0..) |rule, i| {
            if (productive[i]) continue;
            if (self.isProductive(rule.node, merged_rules, &name_index, productive)) {
                productive[i] = true;
                changed = true;
            }
        }
    }

    for (merged_rules, 0..) |rule, i| {
        if (!productive[i]) {
            try self.diagnostics.append(self.allocator, .{
                .kind = .unproductive_rule,
                .rule_name = rule.name,
            });
        }
    }
Validator.zig:102-126

The body check is recursive but small:

fn isProductive(
    self: *Validator,
    node: Ast.Node,
    merged_rules: []const Ast.Rule,
    name_index: *const std.StringHashMap(usize),
    productive: []const bool,
) bool {
    _ = self;
    return switch (node) {
        .char_val, .num_val, .prose_val => true,
        .rulename => |name| {
            if (isCoreRule(name)) return true;
            if (name_index.get(name)) |idx| return productive[idx];
            // Undefined — treat as non-productive (already reported).
            return false;
        },
        .alternation => |items| for (items) |item| {
            if (isProductive(undefined, item, merged_rules, name_index, productive))
                return true;
        } else false,
        .concatenation => |items| {
            for (items) |item| {
                if (!isProductive(undefined, item, merged_rules, name_index, productive))
                    return false;
            }
            return true;
        },
        .repetition => |rep| {
            // *0 (min=0) is always productive (can match zero times).
            if (rep.min == 0) return true;
            return isProductive(undefined, rep.element.*, merged_rules, name_index, productive);
        },
    };
}
Validator.zig:187-220

A literal is productive by definition. An alternation is productive if any branch is. A concatenation is productive only if every element is. A repetition with min = 0 is always productive: it can match zero times, which is a finite (empty) match. A rulename is productive exactly when the rule it points to is.

After the fixed point, every rule still flagged unproductive is part of a non-terminating cycle.

Wiring It Up

The CLI runs both layers and treats the kinds differently:

    var has_errors = false;
    for (validator.diagnostics.items) |v| {
        switch (v.kind) {
            .duplicate_rule => stderr.print("{s}: warning: duplicate definition of '{s}'\n", .{ filename, v.rule_name }) catch {},
            .undefined_rule => {
                stderr.print("{s}: error: rule '{s}' references undefined rule '{s}'\n", .{ filename, v.rule_name, v.ref_name.? }) catch {};
                has_errors = true;
            },
            .unused_rule => stderr.print("{s}: warning: rule '{s}' is defined but never referenced\n", .{ filename, v.rule_name }) catch {},
            .unproductive_rule => {
                stderr.print("{s}: error: rule '{s}' is unproductive (circular with no terminal escape)\n", .{ filename, v.rule_name }) catch {};
                has_errors = true;
            },
        }
    }
    stderr.flush() catch {};
    if (has_errors) std.process.exit(1);
main.zig:46-62

Undefined references and unproductive rules are fatal: the grammar is broken. Duplicates and unused rules are warnings, since the grammar still works.

Given:

sentence = subject verb object
subject = noun
verb    = "runs"
adj     = "blue"

zpars sentence.abnf produces:

sentence.abnf: error: rule 'sentence' references undefined rule 'object'
sentence.abnf: error: rule 'subject' references undefined rule 'noun'
sentence.abnf: warning: rule 'adj' is defined but never referenced

Three issues, one run.

Takeaways

From Tokens to AST
Walking the AST