[we]blog

I recently started zpars, a small parser playground in Zig. The first goal: scan ABNF grammars into tokens. ABNF (Augmented Backus-Naur Form) is the meta-language used to define the syntax of many internet standards — from HTTP to email headers — so it makes a solid first target.

This post walks through the scanner implementation.

Tokens

Before scanning anything we need to define what a token looks like. Every token is just a tagged slice into the original source text:

const Token = @This();

tag: Tag,
/// Byte offset into source where this token's lexeme starts.
start: usize,
/// Length of the lexeme in bytes.
len: usize,
/// Line number (1-based) where this token appears.
line: usize,

No copying, no heap allocations per token — just an offset, a length, and a line number. The tag field carries the kind:

pub const Tag = enum {
    // Single-character tokens
    left_paren, // (
    right_paren, // )
    left_bracket, // [
    right_bracket, // ]
    slash, // /
    star, // *

    // One or two character tokens
    equals, // =
    equals_slash, // =/

    // Literals
    rulename, // e.g. "ALPHA", "my-rule"
    number, // e.g. "3" in "3*5"
    char_val, // "quoted string"
    prose_val, // <prose description>
    bin_val, // %b01010
    dec_val, // %d65
    hex_val, // %x41 or %x41-5A or %x41.42.43

    // Structural
    comment, // ; to end of line
    newline, // CRLF or LF

    // Special
    eof,
    invalid,
};

A helper method recovers the original text when you need it:

pub fn lexeme(self: Token, source: []const u8) []const u8 {
    return source[self.start..self.start + self.len];
}

The Scanner

The scanner struct holds everything needed for a single pass over the source:

const Scanner = @This();

/// The full source text being scanned.
source: []const u8,
/// Collected tokens.
tokens: std.ArrayList(Token),
/// Allocator for the token list.
allocator: std.mem.Allocator,
/// Start of the current lexeme being scanned.
start: usize,
/// Current position in source (next character to read).
current: usize,
/// Current line number (1-based).
line: usize,

start marks the beginning of the lexeme currently being scanned. current is the read head — always pointing at the next character to consume. The separation between the two is what lets us capture multi-character tokens without buffering anything.

Initialization is straightforward:

pub fn init(allocator: std.mem.Allocator, source: []const u8) Scanner {
    return .{
        .source = source,
        .tokens = .empty,
        .allocator = allocator,
        .start = 0,
        .current = 0,
        .line = 1,
    };
}

Note that std.ArrayList in Zig starts as .empty — no allocation happens until the first append.

The Main Loop

scanTokens drives the whole process:

/// Scan the entire source and return the token list.
pub fn scanTokens(self: *Scanner) ![]const Token {
    while (!self.isAtEnd()) {
        // We are at the beginning of the next lexeme.
        self.start = self.current;
        try self.scanToken();
    }

    // Append final EOF token.
    try self.tokens.append(self.allocator, .{
        .tag = .eof,
        .start = self.current,
        .len = 0,
        .line = self.line,
    });

    return self.tokens.items;
}

Each iteration resets start to current, scans one token, and loops. After the input is exhausted we append a sentinel eof token so consumers never have to worry about bounds.

Scanning Individual Tokens

The core of the scanner is a big switch on the current character. Single-character tokens are the simplest case:

fn scanToken(self: *Scanner) !void {
    const c = self.advance();
    switch (c) {
        '(' => try self.addToken(.left_paren),
        ')' => try self.addToken(.right_paren),
        '[' => try self.addToken(.left_bracket),
        ']' => try self.addToken(.right_bracket),
        '*' => try self.addToken(.star),
        '/' => try self.addToken(.slash),

The = character needs a look-ahead — it could be = (definition) or =/ (incremental alternative):

        '=' => try self.addToken(if (self.match('/')) .equals_slash else .equals),

match is a conditional advance: if the next character is /, consume it and return true; otherwise leave the cursor alone.

Strings and Prose

ABNF has two kinds of delimited literals: "quoted strings" and <prose values>. Both follow the same pattern — advance until the closing delimiter or end-of-input:

        // String literals — "..."
        '"' => {
            while (self.peek() != '"' and !self.isAtEnd()) {
                if (self.peek() == '\n') self.line += 1;
                _ = self.advance();
            }
            if (self.isAtEnd()) {
                try self.addToken(.invalid); // unterminated string
            } else {
                _ = self.advance(); // consume closing "
                try self.addToken(.char_val);
            }
        },

If we hit the end before finding the closing quote, we emit .invalid.

Numeric Values

ABNF numeric values start with % followed by a base indicator (b, d, or x), then digits in that base. They can also include . for concatenation (%x41.42.43) or - for ranges (%x41-5A):

        // Numeric values — %b, %d, %x
        '%' => {
            const base = self.peek();
            switch (base) {
                'b' => {
                    _ = self.advance(); // consume 'b'
                    self.consumeDigits(isBit);
                    try self.addToken(.bin_val);
                },
                'd' => {
                    _ = self.advance(); // consume 'd'
                    self.consumeDigits(isDigit);
                    try self.addToken(.dec_val);
                },
                'x' => {
                    _ = self.advance(); // consume 'x'
                    self.consumeDigits(isHexDigit);
                    try self.addToken(.hex_val);
                },
                else => try self.addToken(.invalid), // bare % with no base
            }
        },

The consumeDigits helper handles the digit-dot-digit and digit-dash-digit patterns generically by accepting a function pointer for the digit predicate:

/// Consume digits for a numeric value, including "." and "-" continuations.
/// e.g. for hex: "41" or "41.42.43" or "41-5A"
fn consumeDigits(self: *Scanner, isValidDigit: *const fn (u8) bool) void {
    // Consume first group of digits.
    while (isValidDigit(self.peek())) _ = self.advance();

    // Check for "." (concatenation) or "-" (range) continuation.
    if (self.peek() == '.') {
        // Dot-separated: %x41.42.43
        while (self.peek() == '.') {
            _ = self.advance(); // consume '.'
            while (isValidDigit(self.peek())) _ = self.advance();
        }
    } else if (self.peek() == '-') {
        // Range: %x41-5A
        _ = self.advance(); // consume '-'
        while (isValidDigit(self.peek())) _ = self.advance();
    }
}

This is one of the places where Zig’s first-class function pointers feel natural. There is no generics ceremony — just pass the predicate directly.

Rulenames, Numbers, and Whitespace

The else branch handles identifiers (rulenames), bare numbers for repetition operators, and everything else:

        else => {
            if (isAlpha(c)) {
                // Rulename: ALPHA *(ALPHA / DIGIT / "-")
                while (isAlpha(self.peek()) or isDigit(self.peek()) or self.peek() == '-') {
                    _ = self.advance();
                }
                try self.addToken(.rulename);
            } else if (isDigit(c)) {
                // Number: 1*DIGIT (used in repetition)
                while (isDigit(self.peek())) _ = self.advance();
                try self.addToken(.number);
            } else {
                try self.addToken(.invalid);
            }
        },

Whitespace (spaces and tabs) is silently skipped. Newlines get their own token since they are significant in ABNF — they terminate rules:

        // Whitespace — skip silently.
        ' ', '\t' => {},

        // Newlines — emit token and bump line counter.
        '\r' => {
            _ = self.match('\n'); // consume LF after CR (CRLF)
            self.line += 1;
            try self.addToken(.newline);
        },
        '\n' => {
            self.line += 1;
            try self.addToken(.newline);
        },

Primitive Operations

The scanner relies on a handful of small helpers that operate on the character stream:

// === Primitive operations ===

/// Consume the current character and return it.
fn advance(self: *Scanner) u8 {
    const c = self.source[self.current];
    self.current += 1;
    return c;
}

/// Look at the current character without consuming it.
/// Returns 0 if at end.
fn peek(self: *Scanner) u8 {
    if (self.isAtEnd()) return 0;
    return self.source[self.current];
}

/// Look one character ahead (past current). Returns 0 if at end.
fn peekNext(self: *Scanner) u8 {
    if (self.current + 1 >= self.source.len) return 0;
    return self.source[self.current + 1];
}

/// Conditional advance: if current char matches `expected`, consume it
/// and return true. Otherwise return false.
fn match(self: *Scanner, expected: u8) bool {
    if (self.isAtEnd()) return false;
    if (self.source[self.current] != expected) return false;
    self.current += 1;
    return true;
}

/// Consume digits for a numeric value, including "." and "-" continuations.
/// e.g. for hex: "41" or "41.42.43" or "41-5A"
fn consumeDigits(self: *Scanner, isValidDigit: *const fn (u8) bool) void {
    // Consume first group of digits.
    while (isValidDigit(self.peek())) _ = self.advance();

    // Check for "." (concatenation) or "-" (range) continuation.
    if (self.peek() == '.') {
        // Dot-separated: %x41.42.43
        while (self.peek() == '.') {
            _ = self.advance(); // consume '.'
            while (isValidDigit(self.peek())) _ = self.advance();
        }
    } else if (self.peek() == '-') {
        // Range: %x41-5A
        _ = self.advance(); // consume '-'
        while (isValidDigit(self.peek())) _ = self.advance();
    }
}

peek returns 0 at end-of-input, which naturally falls through every character comparison without needing explicit end checks everywhere.

Wiring It Up

The main.zig reads a file, runs the scanner, and prints each token:

pub fn main() !void {
    var gpa: std.heap.GeneralPurposeAllocator(.{}) = .init;
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const args = try std.process.argsAlloc(allocator);
    defer std.process.argsFree(allocator, args);

    if (args.len < 2) {
        std.debug.print("usage: zpars <file.abnf>\n", .{});
        std.process.exit(1);
    }

    const source = try std.fs.cwd().readFileAlloc(allocator, args[1], 1024 * 1024);
    defer allocator.free(source);

    var scanner = zpars.Scanner.init(allocator, source);
    defer scanner.deinit();

    const tokens = try scanner.scanTokens();

    var stdout_buffer: [4096]u8 = undefined;
    var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
    const stdout = &stdout_writer.interface;

    for (tokens) |tok| {
        try stdout.print("[{d}:{d: >3}] {s: <16} \"{s}\"\n", .{
            tok.line,
            tok.start,
            @tagName(tok.tag),
            tok.lexeme(source),
        });
    }
    try stdout.flush();
}

Running it on ABNF’s own grammar definition (zig build run -- examples/rfc5234.abnf) produces output like:

[1:  0] rulename         "rulelist"
[1:  9] equals           "="
[1: 11] number           "1"
[1: 12] star             "*"
[1: 13] left_paren       "("
[1: 14] rulename         "rule"
...

Each line shows the line number, byte offset, token tag, and the actual lexeme from the source.

Takeaways

A few things stood out while writing this:

Zero-copy tokens work well. Storing offsets into the source string avoids allocations per token and keeps the data cache-friendly.
std.ArrayList with explicit allocator. Zig’s approach of passing the allocator explicitly makes it obvious where memory comes from and easy to swap allocators later.
Function pointers for digit predicates. No generics or comptime needed — consumeDigits takes a plain function pointer and the compiler can still inline it.
The scanner is ~180 lines. ABNF is a small language, and Zig doesn’t get in the way. No boilerplate, no ceremony.

The full source is at github.com/q-uint/zpars. Next step: building a parser on top of these tokens.