All the Errors at Once
The parser from last time had a problem: on the first malformed token it gave up. One bad rule meant nothing else in the file got parsed. And even on a clean parse, plenty of broken grammars sneak through. A rule referencing something undefined. A rule defined but never used. A pair of rules that recurse into each other forever with no terminal to anchor them.
This post fixes both halves of that story:
- Recovery in the parser: catch a syntax error, skip to the next rule, keep going. Accumulate all syntax diagnostics instead of bailing on the first.
- A validator that runs on the AST after parsing succeeds, catching the semantic problems the parser fundamentally can’t see.
About 470 lines across Parser.zig and a new Validator.zig.
Recovery
The previous parser stored ?Diagnostic (a single optional) and threw error.SyntaxError on the first failure. The fix has two pieces: turn that field into a list, then stop bailing.
const std = @import("std");
const Token = @import("Token.zig");
const Ast = @import("Ast.zig");
const Diagnostic = @import("Diagnostic.zig");
const Parser = @This();
pub const ParseError = error{ SyntaxError, OutOfMemory, Overflow, InvalidCharacter };
tokens: []const Token,
source: []const u8,
allocator: std.mem.Allocator,
pos: usize,
diagnostics: std.ArrayList(Diagnostic) = .empty,
diagnostics is now an ArrayList. The fail helper appends to it instead of overwriting:
fn fail(self: *Parser, expected: Diagnostic.Expected, token: Token) error{OutOfMemory}!void {
try self.diagnostics.append(self.allocator, .{
.expected = expected,
.found_tag = token.tag,
.found_start = token.start,
.found_len = token.len,
.line = token.line,
});
}
And parse() catches the syntax error, calls synchronize(), and loops back to try the next rule:
/// Parse all rules from the token stream.
pub fn parse(self: *Parser) ![]const Ast.Rule {
var rules: std.ArrayList(Ast.Rule) = .empty;
self.skipTrivia();
while (self.peek().tag != .eof) {
const rule = self.parseRule() catch |err| switch (err) {
error.SyntaxError => {
self.synchronize();
self.skipTrivia();
continue;
},
else => return err,
};
try rules.append(self.allocator, rule);
self.skipTrivia();
}
return try rules.toOwnedSlice(self.allocator);
}
synchronize is the recovery primitive. It skips tokens until it finds something that looks like the start of a new rule: a rulename followed by = or =/.
/// Skip tokens until the start of the next rule or EOF.
/// A rule boundary is a `rulename` followed by `=` or `=/`.
fn synchronize(self: *Parser) void {
while (self.peek().tag != .eof) {
if (self.peek().tag == .rulename) {
const next = self.peekNextMeaningful();
if (next == .equals or next == .equals_slash) return;
}
self.pos += 1;
}
}
That’s the entire recovery story. The parser doesn’t try to repair malformed input. It just abandons whatever it was parsing and restarts at the next rule boundary. Cheap, robust, easy to reason about.
The lookahead used here (peekNextMeaningful) is the same one isAtRepetition used to know when concatenation has run out of elements. The rule-boundary heuristic is consistent across the parser.
Validation
Recovery makes the parser robust to bad syntax. But syntactically valid grammars can still be broken in ways the parser fundamentally can’t see:
foo = bar ; bar is never defined
unused = "x" ; defined but never referenced
a = b ; a → b → a, no terminal — never matches anything
b = a
A separate Validator pass runs on the AST after parsing, with the same accumulator pattern as the parser:
const std = @import("std");
const Ast = @import("Ast.zig");
const Validator = @This();
rules: []const Ast.Rule,
allocator: std.mem.Allocator,
diagnostics: std.ArrayList(Validation) = .empty,
pub const Validation = struct {
kind: Kind,
/// The rule in which the issue was found.
rule_name: []const u8,
/// The undefined reference name (only set for `.undefined_rule`).
ref_name: ?[]const u8 = null,
pub const Kind = enum {
duplicate_rule,
undefined_rule,
unused_rule,
unproductive_rule,
};
};
Four kinds of issues, all collected into a single list. Same model as the parser’s diagnostics: the caller decides what to do with them (treat as errors, treat as warnings, format and exit, etc.).
validate() runs the rule list through a series of stages. I’ll skip the merging-and-duplicate-detection pass¹ and pull out the more interesting ones.
Undefined References
Walk the AST, collect every rulename reference into a set:
fn collectRefs(self: *Validator, node: Ast.Node, refs: *std.StringHashMap(void)) !void {
switch (node) {
.rulename => |name| try refs.put(name, {}),
.alternation => |items| for (items) |item| try self.collectRefs(item, refs),
.concatenation => |items| for (items) |item| try self.collectRefs(item, refs),
.repetition => |rep| try self.collectRefs(rep.element.*, refs),
.char_val, .num_val, .prose_val => {},
}
}
Then check each reference against the defined-rule index, skipping the standard library. ABNF’s core rules like ALPHA, DIGIT, CRLF are implicitly available without needing a definition.
// Stage 3: check for undefined references.
var ref_iter = refs.keyIterator();
while (ref_iter.next()) |ref_name| {
if (name_index.contains(ref_name.*)) continue;
if (isCoreRule(ref_name.*)) continue;
// Find the first rule that references this undefined name.
const owner = self.findReferencer(merged_rules, ref_name.*);
try self.diagnostics.append(self.allocator, .{
.kind = .undefined_rule,
.rule_name = owner orelse ref_name.*,
.ref_name = ref_name.*,
});
}
Unused Rules
The mirror image: a rule that’s defined but never appears in the reference set. With one wrinkle: the first rule is the start symbol, so it’s never referenced and not unused.
// Stage 4: unused rules (skip index 0 — start symbol).
for (merged_rules[1..]) |rule| {
if (!refs.contains(rule.name)) {
try self.diagnostics.append(self.allocator, .{
.kind = .unused_rule,
.rule_name = rule.name,
});
}
}
Productivity (Cycle Detection)
The interesting one. A rule is productive if it can match some finite string. a = "x" is productive. a = b followed by b = a is not: the cycle never bottoms out at a terminal.
The classic fixed-point algorithm: assume nothing is productive, then iterate. On each pass, mark a rule productive if its body is. Repeat until nothing changes:
// Stage 5: productivity / cycle detection.
var productive = try self.allocator.alloc(bool, merged_rules.len);
defer self.allocator.free(productive);
@memset(productive, false);
var changed = true;
while (changed) {
changed = false;
for (merged_rules, 0..) |rule, i| {
if (productive[i]) continue;
if (self.isProductive(rule.node, merged_rules, &name_index, productive)) {
productive[i] = true;
changed = true;
}
}
}
for (merged_rules, 0..) |rule, i| {
if (!productive[i]) {
try self.diagnostics.append(self.allocator, .{
.kind = .unproductive_rule,
.rule_name = rule.name,
});
}
}
The body check is recursive but small:
fn isProductive(
self: *Validator,
node: Ast.Node,
merged_rules: []const Ast.Rule,
name_index: *const std.StringHashMap(usize),
productive: []const bool,
) bool {
_ = self;
return switch (node) {
.char_val, .num_val, .prose_val => true,
.rulename => |name| {
if (isCoreRule(name)) return true;
if (name_index.get(name)) |idx| return productive[idx];
// Undefined — treat as non-productive (already reported).
return false;
},
.alternation => |items| for (items) |item| {
if (isProductive(undefined, item, merged_rules, name_index, productive))
return true;
} else false,
.concatenation => |items| {
for (items) |item| {
if (!isProductive(undefined, item, merged_rules, name_index, productive))
return false;
}
return true;
},
.repetition => |rep| {
// *0 (min=0) is always productive (can match zero times).
if (rep.min == 0) return true;
return isProductive(undefined, rep.element.*, merged_rules, name_index, productive);
},
};
}
A literal is productive by definition. An alternation is productive if any branch is. A concatenation is productive only if every element is. A repetition with min = 0 is always productive: it can match zero times, which is a finite (empty) match. A rulename is productive exactly when the rule it points to is.
After the fixed point, every rule still flagged unproductive is part of a non-terminating cycle.
Wiring It Up
The CLI runs both layers and treats the kinds differently:
var has_errors = false;
for (validator.diagnostics.items) |v| {
switch (v.kind) {
.duplicate_rule => stderr.print("{s}: warning: duplicate definition of '{s}'\n", .{ filename, v.rule_name }) catch {},
.undefined_rule => {
stderr.print("{s}: error: rule '{s}' references undefined rule '{s}'\n", .{ filename, v.rule_name, v.ref_name.? }) catch {};
has_errors = true;
},
.unused_rule => stderr.print("{s}: warning: rule '{s}' is defined but never referenced\n", .{ filename, v.rule_name }) catch {},
.unproductive_rule => {
stderr.print("{s}: error: rule '{s}' is unproductive (circular with no terminal escape)\n", .{ filename, v.rule_name }) catch {};
has_errors = true;
},
}
}
stderr.flush() catch {};
if (has_errors) std.process.exit(1);
Undefined references and unproductive rules are fatal: the grammar is broken. Duplicates and unused rules are warnings, since the grammar still works.
Given:
sentence = subject verb object
subject = noun
verb = "runs"
adj = "blue"
zpars sentence.abnf produces:
sentence.abnf: error: rule 'sentence' references undefined rule 'object'
sentence.abnf: error: rule 'subject' references undefined rule 'noun'
sentence.abnf: warning: rule 'adj' is defined but never referenced
Three issues, one run.
Takeaways
- Diagnostics as an accumulator. The same shape works at both layers: an
ArrayListthat grows as issues are found, with a tagged enum naming each kind. The caller decides policy. - Recovery doesn’t have to be smart. Skip to the next rule, drop whatever was being parsed, move on. No attempt to repair, no speculation about what the user meant.
- Two layers, two responsibilities. The parser’s job is “is this valid syntax.” The validator’s job is “is this a sane grammar.” When the same code tries to do both, both jobs get diluted; splitting them keeps each one short and the boundary obvious.
- Productivity is one fixed-point loop. Twenty lines, no graph library, no SCC algorithm. The naïve formulation is fine for any grammar a human will write.