[we]blog

Writing a Simple Scanner in Zig

I recently started zpars, a small parser playground in Zig. The first goal: scan ABNF grammars into tokens. ABNF (Augmented Backus-Naur Form) is the meta-language used to define the syntax of many internet standards — from HTTP to email headers — so it makes a solid first target.

This post walks through the scanner implementation.

Tokens

Before scanning anything we need to define what a token looks like. Every token is just a tagged slice into the original source text:

const Token = @This();

tag: Tag,
/// Byte offset into source where this token's lexeme starts.
start: usize,
/// Length of the lexeme in bytes.
len: usize,
/// Line number (1-based) where this token appears.
line: usize,

No copying, no heap allocations per token — just an offset, a length, and a line number. The tag field carries the kind:

pub const Tag = enum {
    // Single-character tokens
    left_paren, // (
    right_paren, // )
    left_bracket, // [
    right_bracket, // ]
    slash, // /
    star, // *

    // One or two character tokens
    equals, // =
    equals_slash, // =/

    // Literals
    rulename, // e.g. "ALPHA", "my-rule"
    number, // e.g. "3" in "3*5"
    char_val, // "quoted string"
    prose_val, // <prose description>
    bin_val, // %b01010
    dec_val, // %d65
    hex_val, // %x41 or %x41-5A or %x41.42.43

    // Structural
    comment, // ; to end of line
    newline, // CRLF or LF

    // Special
    eof,
    invalid,
};

A helper method recovers the original text when you need it:

pub fn lexeme(self: Token, source: []const u8) []const u8 {
    return source[self.start..self.start + self.len];
}

The Scanner

The scanner struct holds everything needed for a single pass over the source:

const Scanner = @This();

/// The full source text being scanned.
source: []const u8,
/// Collected tokens.
tokens: std.ArrayList(Token),
/// Allocator for the token list.
allocator: std.mem.Allocator,
/// Start of the current lexeme being scanned.
start: usize,
/// Current position in source (next character to read).
current: usize,
/// Current line number (1-based).
line: usize,

start marks the beginning of the lexeme currently being scanned. current is the read head — always pointing at the next character to consume. The separation between the two is what lets us capture multi-character tokens without buffering anything.

Initialization is straightforward:

pub fn init(allocator: std.mem.Allocator, source: []const u8) Scanner {
    return .{
        .source = source,
        .tokens = .empty,
        .allocator = allocator,
        .start = 0,
        .current = 0,
        .line = 1,
    };
}

Note that std.ArrayList in Zig starts as .empty — no allocation happens until the first append.

The Main Loop

scanTokens drives the whole process:

/// Scan the entire source and return the token list.
pub fn scanTokens(self: *Scanner) ![]const Token {
    while (!self.isAtEnd()) {
        // We are at the beginning of the next lexeme.
        self.start = self.current;
        try self.scanToken();
    }

    // Append final EOF token.
    try self.tokens.append(self.allocator, .{
        .tag = .eof,
        .start = self.current,
        .len = 0,
        .line = self.line,
    });

    return self.tokens.items;
}

Each iteration resets start to current, scans one token, and loops. After the input is exhausted we append a sentinel eof token so consumers never have to worry about bounds.

Scanning Individual Tokens

The core of the scanner is a big switch on the current character. Single-character tokens are the simplest case:

fn scanToken(self: *Scanner) !void {
    const c = self.advance();
    switch (c) {
        '(' => try self.addToken(.left_paren),
        ')' => try self.addToken(.right_paren),
        '[' => try self.addToken(.left_bracket),
        ']' => try self.addToken(.right_bracket),
        '*' => try self.addToken(.star),
        '/' => try self.addToken(.slash),

The = character needs a look-ahead — it could be = (definition) or =/ (incremental alternative):

        '=' => try self.addToken(if (self.match('/')) .equals_slash else .equals),

match is a conditional advance: if the next character is /, consume it and return true; otherwise leave the cursor alone.

Strings and Prose

ABNF has two kinds of delimited literals: "quoted strings" and <prose values>. Both follow the same pattern — advance until the closing delimiter or end-of-input:

        // String literals — "..."
        '"' => {
            while (self.peek() != '"' and !self.isAtEnd()) {
                if (self.peek() == '\n') self.line += 1;
                _ = self.advance();
            }
            if (self.isAtEnd()) {
                try self.addToken(.invalid); // unterminated string
            } else {
                _ = self.advance(); // consume closing "
                try self.addToken(.char_val);
            }
        },

If we hit the end before finding the closing quote, we emit .invalid.

Numeric Values

ABNF numeric values start with % followed by a base indicator (b, d, or x), then digits in that base. They can also include . for concatenation (%x41.42.43) or - for ranges (%x41-5A):

        // Numeric values — %b, %d, %x
        '%' => {
            const base = self.peek();
            switch (base) {
                'b' => {
                    _ = self.advance(); // consume 'b'
                    self.consumeDigits(isBit);
                    try self.addToken(.bin_val);
                },
                'd' => {
                    _ = self.advance(); // consume 'd'
                    self.consumeDigits(isDigit);
                    try self.addToken(.dec_val);
                },
                'x' => {
                    _ = self.advance(); // consume 'x'
                    self.consumeDigits(isHexDigit);
                    try self.addToken(.hex_val);
                },
                else => try self.addToken(.invalid), // bare % with no base
            }
        },

The consumeDigits helper handles the digit-dot-digit and digit-dash-digit patterns generically by accepting a function pointer for the digit predicate:

/// Consume digits for a numeric value, including "." and "-" continuations.
/// e.g. for hex: "41" or "41.42.43" or "41-5A"
fn consumeDigits(self: *Scanner, isValidDigit: *const fn (u8) bool) void {
    // Consume first group of digits.
    while (isValidDigit(self.peek())) _ = self.advance();

    // Check for "." (concatenation) or "-" (range) continuation.
    if (self.peek() == '.') {
        // Dot-separated: %x41.42.43
        while (self.peek() == '.') {
            _ = self.advance(); // consume '.'
            while (isValidDigit(self.peek())) _ = self.advance();
        }
    } else if (self.peek() == '-') {
        // Range: %x41-5A
        _ = self.advance(); // consume '-'
        while (isValidDigit(self.peek())) _ = self.advance();
    }
}

This is one of the places where Zig’s first-class function pointers feel natural. There is no generics ceremony — just pass the predicate directly.

Rulenames, Numbers, and Whitespace

The else branch handles identifiers (rulenames), bare numbers for repetition operators, and everything else:

        else => {
            if (isAlpha(c)) {
                // Rulename: ALPHA *(ALPHA / DIGIT / "-")
                while (isAlpha(self.peek()) or isDigit(self.peek()) or self.peek() == '-') {
                    _ = self.advance();
                }
                try self.addToken(.rulename);
            } else if (isDigit(c)) {
                // Number: 1*DIGIT (used in repetition)
                while (isDigit(self.peek())) _ = self.advance();
                try self.addToken(.number);
            } else {
                try self.addToken(.invalid);
            }
        },

Whitespace (spaces and tabs) is silently skipped. Newlines get their own token since they are significant in ABNF — they terminate rules:

        // Whitespace — skip silently.
        ' ', '\t' => {},

        // Newlines — emit token and bump line counter.
        '\r' => {
            _ = self.match('\n'); // consume LF after CR (CRLF)
            self.line += 1;
            try self.addToken(.newline);
        },
        '\n' => {
            self.line += 1;
            try self.addToken(.newline);
        },

Primitive Operations

The scanner relies on a handful of small helpers that operate on the character stream:

// === Primitive operations ===

/// Consume the current character and return it.
fn advance(self: *Scanner) u8 {
    const c = self.source[self.current];
    self.current += 1;
    return c;
}

/// Look at the current character without consuming it.
/// Returns 0 if at end.
fn peek(self: *Scanner) u8 {
    if (self.isAtEnd()) return 0;
    return self.source[self.current];
}

/// Look one character ahead (past current). Returns 0 if at end.
fn peekNext(self: *Scanner) u8 {
    if (self.current + 1 >= self.source.len) return 0;
    return self.source[self.current + 1];
}

/// Conditional advance: if current char matches `expected`, consume it
/// and return true. Otherwise return false.
fn match(self: *Scanner, expected: u8) bool {
    if (self.isAtEnd()) return false;
    if (self.source[self.current] != expected) return false;
    self.current += 1;
    return true;
}

/// Consume digits for a numeric value, including "." and "-" continuations.
/// e.g. for hex: "41" or "41.42.43" or "41-5A"
fn consumeDigits(self: *Scanner, isValidDigit: *const fn (u8) bool) void {
    // Consume first group of digits.
    while (isValidDigit(self.peek())) _ = self.advance();

    // Check for "." (concatenation) or "-" (range) continuation.
    if (self.peek() == '.') {
        // Dot-separated: %x41.42.43
        while (self.peek() == '.') {
            _ = self.advance(); // consume '.'
            while (isValidDigit(self.peek())) _ = self.advance();
        }
    } else if (self.peek() == '-') {
        // Range: %x41-5A
        _ = self.advance(); // consume '-'
        while (isValidDigit(self.peek())) _ = self.advance();
    }
}

peek returns 0 at end-of-input, which naturally falls through every character comparison without needing explicit end checks everywhere.

Wiring It Up

The main.zig reads a file, runs the scanner, and prints each token:

pub fn main() !void {
    var gpa: std.heap.GeneralPurposeAllocator(.{}) = .init;
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const args = try std.process.argsAlloc(allocator);
    defer std.process.argsFree(allocator, args);

    if (args.len < 2) {
        std.debug.print("usage: zpars <file.abnf>\n", .{});
        std.process.exit(1);
    }

    const source = try std.fs.cwd().readFileAlloc(allocator, args[1], 1024 * 1024);
    defer allocator.free(source);

    var scanner = zpars.Scanner.init(allocator, source);
    defer scanner.deinit();

    const tokens = try scanner.scanTokens();

    var stdout_buffer: [4096]u8 = undefined;
    var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
    const stdout = &stdout_writer.interface;

    for (tokens) |tok| {
        try stdout.print("[{d}:{d: >3}] {s: <16} \"{s}\"\n", .{
            tok.line,
            tok.start,
            @tagName(tok.tag),
            tok.lexeme(source),
        });
    }
    try stdout.flush();
}

Running it on ABNF’s own grammar definition (zig build run -- examples/rfc5234.abnf) produces output like:

[1:  0] rulename         "rulelist"
[1:  9] equals           "="
[1: 11] number           "1"
[1: 12] star             "*"
[1: 13] left_paren       "("
[1: 14] rulename         "rule"
...

Each line shows the line number, byte offset, token tag, and the actual lexeme from the source.

Takeaways

A few things stood out while writing this:

The full source is at github.com/q-uint/zpars. Next step: building a parser on top of these tokens.