I recently started zpars, a small parser playground in Zig. The first goal: scan ABNF grammars into tokens. ABNF (Augmented Backus-Naur Form) is the meta-language used to define the syntax of many internet standards — from HTTP to email headers — so it makes a solid first target.
This post walks through the scanner implementation.
Before scanning anything we need to define what a token looks like. Every token is just a tagged slice into the original source text:
const Token = @This();
tag: Tag,
/// Byte offset into source where this token's lexeme starts.
start: usize,
/// Length of the lexeme in bytes.
len: usize,
/// Line number (1-based) where this token appears.
line: usize,
No copying, no heap allocations per token — just an offset, a length, and a line number. The tag field carries the kind:
pub const Tag = enum {
// Single-character tokens
left_paren, // (
right_paren, // )
left_bracket, // [
right_bracket, // ]
slash, // /
star, // *
// One or two character tokens
equals, // =
equals_slash, // =/
// Literals
rulename, // e.g. "ALPHA", "my-rule"
number, // e.g. "3" in "3*5"
char_val, // "quoted string"
prose_val, // <prose description>
bin_val, // %b01010
dec_val, // %d65
hex_val, // %x41 or %x41-5A or %x41.42.43
// Structural
comment, // ; to end of line
newline, // CRLF or LF
// Special
eof,
invalid,
};
A helper method recovers the original text when you need it:
pub fn lexeme(self: Token, source: []const u8) []const u8 {
return source[self.start..self.start + self.len];
}
The scanner struct holds everything needed for a single pass over the source:
const Scanner = @This();
/// The full source text being scanned.
source: []const u8,
/// Collected tokens.
tokens: std.ArrayList(Token),
/// Allocator for the token list.
allocator: std.mem.Allocator,
/// Start of the current lexeme being scanned.
start: usize,
/// Current position in source (next character to read).
current: usize,
/// Current line number (1-based).
line: usize,
start marks the beginning of the lexeme currently being scanned. current is the read head — always pointing at the next character to consume. The separation between the two is what lets us capture multi-character tokens without buffering anything.
Initialization is straightforward:
pub fn init(allocator: std.mem.Allocator, source: []const u8) Scanner {
return .{
.source = source,
.tokens = .empty,
.allocator = allocator,
.start = 0,
.current = 0,
.line = 1,
};
}
Note that std.ArrayList in Zig starts as .empty — no allocation happens until the first append.
scanTokens drives the whole process:
/// Scan the entire source and return the token list.
pub fn scanTokens(self: *Scanner) ![]const Token {
while (!self.isAtEnd()) {
// We are at the beginning of the next lexeme.
self.start = self.current;
try self.scanToken();
}
// Append final EOF token.
try self.tokens.append(self.allocator, .{
.tag = .eof,
.start = self.current,
.len = 0,
.line = self.line,
});
return self.tokens.items;
}
Each iteration resets start to current, scans one token, and loops. After the input is exhausted we append a sentinel eof token so consumers never have to worry about bounds.
The core of the scanner is a big switch on the current character. Single-character tokens are the simplest case:
fn scanToken(self: *Scanner) !void {
const c = self.advance();
switch (c) {
'(' => try self.addToken(.left_paren),
')' => try self.addToken(.right_paren),
'[' => try self.addToken(.left_bracket),
']' => try self.addToken(.right_bracket),
'*' => try self.addToken(.star),
'/' => try self.addToken(.slash),
The = character needs a look-ahead — it could be = (definition) or =/ (incremental alternative):
'=' => try self.addToken(if (self.match('/')) .equals_slash else .equals),
match is a conditional advance: if the next character is /, consume it and return true; otherwise leave the cursor alone.
ABNF has two kinds of delimited literals: "quoted strings" and <prose values>. Both follow the same pattern — advance until the closing delimiter or end-of-input:
// String literals — "..."
'"' => {
while (self.peek() != '"' and !self.isAtEnd()) {
if (self.peek() == '\n') self.line += 1;
_ = self.advance();
}
if (self.isAtEnd()) {
try self.addToken(.invalid); // unterminated string
} else {
_ = self.advance(); // consume closing "
try self.addToken(.char_val);
}
},
If we hit the end before finding the closing quote, we emit .invalid.
ABNF numeric values start with % followed by a base indicator (b, d, or x), then digits in that base. They can also include . for concatenation (%x41.42.43) or - for ranges (%x41-5A):
// Numeric values — %b, %d, %x
'%' => {
const base = self.peek();
switch (base) {
'b' => {
_ = self.advance(); // consume 'b'
self.consumeDigits(isBit);
try self.addToken(.bin_val);
},
'd' => {
_ = self.advance(); // consume 'd'
self.consumeDigits(isDigit);
try self.addToken(.dec_val);
},
'x' => {
_ = self.advance(); // consume 'x'
self.consumeDigits(isHexDigit);
try self.addToken(.hex_val);
},
else => try self.addToken(.invalid), // bare % with no base
}
},
The consumeDigits helper handles the digit-dot-digit and digit-dash-digit patterns generically by accepting a function pointer for the digit predicate:
/// Consume digits for a numeric value, including "." and "-" continuations.
/// e.g. for hex: "41" or "41.42.43" or "41-5A"
fn consumeDigits(self: *Scanner, isValidDigit: *const fn (u8) bool) void {
// Consume first group of digits.
while (isValidDigit(self.peek())) _ = self.advance();
// Check for "." (concatenation) or "-" (range) continuation.
if (self.peek() == '.') {
// Dot-separated: %x41.42.43
while (self.peek() == '.') {
_ = self.advance(); // consume '.'
while (isValidDigit(self.peek())) _ = self.advance();
}
} else if (self.peek() == '-') {
// Range: %x41-5A
_ = self.advance(); // consume '-'
while (isValidDigit(self.peek())) _ = self.advance();
}
}
This is one of the places where Zig’s first-class function pointers feel natural. There is no generics ceremony — just pass the predicate directly.
The else branch handles identifiers (rulenames), bare numbers for repetition operators, and everything else:
else => {
if (isAlpha(c)) {
// Rulename: ALPHA *(ALPHA / DIGIT / "-")
while (isAlpha(self.peek()) or isDigit(self.peek()) or self.peek() == '-') {
_ = self.advance();
}
try self.addToken(.rulename);
} else if (isDigit(c)) {
// Number: 1*DIGIT (used in repetition)
while (isDigit(self.peek())) _ = self.advance();
try self.addToken(.number);
} else {
try self.addToken(.invalid);
}
},
Whitespace (spaces and tabs) is silently skipped. Newlines get their own token since they are significant in ABNF — they terminate rules:
// Whitespace — skip silently.
' ', '\t' => {},
// Newlines — emit token and bump line counter.
'\r' => {
_ = self.match('\n'); // consume LF after CR (CRLF)
self.line += 1;
try self.addToken(.newline);
},
'\n' => {
self.line += 1;
try self.addToken(.newline);
},
The scanner relies on a handful of small helpers that operate on the character stream:
// === Primitive operations ===
/// Consume the current character and return it.
fn advance(self: *Scanner) u8 {
const c = self.source[self.current];
self.current += 1;
return c;
}
/// Look at the current character without consuming it.
/// Returns 0 if at end.
fn peek(self: *Scanner) u8 {
if (self.isAtEnd()) return 0;
return self.source[self.current];
}
/// Look one character ahead (past current). Returns 0 if at end.
fn peekNext(self: *Scanner) u8 {
if (self.current + 1 >= self.source.len) return 0;
return self.source[self.current + 1];
}
/// Conditional advance: if current char matches `expected`, consume it
/// and return true. Otherwise return false.
fn match(self: *Scanner, expected: u8) bool {
if (self.isAtEnd()) return false;
if (self.source[self.current] != expected) return false;
self.current += 1;
return true;
}
/// Consume digits for a numeric value, including "." and "-" continuations.
/// e.g. for hex: "41" or "41.42.43" or "41-5A"
fn consumeDigits(self: *Scanner, isValidDigit: *const fn (u8) bool) void {
// Consume first group of digits.
while (isValidDigit(self.peek())) _ = self.advance();
// Check for "." (concatenation) or "-" (range) continuation.
if (self.peek() == '.') {
// Dot-separated: %x41.42.43
while (self.peek() == '.') {
_ = self.advance(); // consume '.'
while (isValidDigit(self.peek())) _ = self.advance();
}
} else if (self.peek() == '-') {
// Range: %x41-5A
_ = self.advance(); // consume '-'
while (isValidDigit(self.peek())) _ = self.advance();
}
}
peek returns 0 at end-of-input, which naturally falls through every character comparison without needing explicit end checks everywhere.
The main.zig reads a file, runs the scanner, and prints each token:
pub fn main() !void {
var gpa: std.heap.GeneralPurposeAllocator(.{}) = .init;
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const args = try std.process.argsAlloc(allocator);
defer std.process.argsFree(allocator, args);
if (args.len < 2) {
std.debug.print("usage: zpars <file.abnf>\n", .{});
std.process.exit(1);
}
const source = try std.fs.cwd().readFileAlloc(allocator, args[1], 1024 * 1024);
defer allocator.free(source);
var scanner = zpars.Scanner.init(allocator, source);
defer scanner.deinit();
const tokens = try scanner.scanTokens();
var stdout_buffer: [4096]u8 = undefined;
var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
const stdout = &stdout_writer.interface;
for (tokens) |tok| {
try stdout.print("[{d}:{d: >3}] {s: <16} \"{s}\"\n", .{
tok.line,
tok.start,
@tagName(tok.tag),
tok.lexeme(source),
});
}
try stdout.flush();
}
Running it on ABNF’s own grammar definition (zig build run -- examples/rfc5234.abnf) produces output like:
[1: 0] rulename "rulelist"
[1: 9] equals "="
[1: 11] number "1"
[1: 12] star "*"
[1: 13] left_paren "("
[1: 14] rulename "rule"
...
Each line shows the line number, byte offset, token tag, and the actual lexeme from the source.
A few things stood out while writing this:
std.ArrayList with explicit allocator. Zig’s approach of passing the allocator explicitly makes it obvious where memory comes from and easy to swap allocators later.consumeDigits takes a plain function pointer and the compiler can still inline it.The full source is at github.com/q-uint/zpars. Next step: building a parser on top of these tokens.