A Native AMDGCN Backend for Zig

This is the first post in a series I have no business starting.

How I got here

I’ve been poking at Zig’s native SPIR-V¹ backend. Not because I had a plan, but because that’s how I learn anything: I open a door, see a corridor, and forget what I came for. SPIR-V led to “how does a GPU actually receive code,” which led to “what does the machine code even look like,” which led to me staring at an AMD ISA² manual at an hour I’m not proud of.

Eventually I stopped and asked the obvious question:

Could Zig emit code for my GPU directly, no LLVM, no SPIR-V, no intermediary, the way its self-hosted backends already do for x86_64 and aarch64?

And then, against my better judgement, I decided to find out.

Why AMDGCN

Two reasons, one principled and one entirely accidental.

The principled one: AMD is the open option. The instruction set is documented in public whitepapers, the runtime (ROCm³) is open source, and the whole code-object format is specified rather than reverse-engineered. NVIDIA’s stack is a wall of closed PTX/SASS⁴ and a driver that guards the real ISA. If I’m going to hand-assemble GPU machine code to learn how it works, I’d like the manual to actually exist.

The accidental one: I happen to own an all-AMD desktop. I built it a while back without any of this in mind, and it turns out to be exactly the machine you want for this kind of masochism:

GPU: Radeon RX 7900 (Navi 31, gfx1100, RDNA3, wave32)⁵
CPU: Ryzen 7 7800X3D (8 cores / 16 threads)
RAM: 64 GB
iGPU: Raphael, riding along on the Ryzen

So the target picks itself: amdgcn-amd-amdhsa⁶, gfx1100. One GPU generation, one code-object version.

Where things actually stand

I want to be honest about how little I’ve done, because the gap between “I started” and “it works” is the whole point of this series.

The compiler work I’ve actually landed so far was all on the SPIR-V side. While poking at Zig’s native SPIR-V backend I fixed a few very, very minor compiler errors, the kind that barely deserve a commit message, and got its test happy. That’s the rabbit hole that eventually spat me out in front of the AMDGCN question. None of it was AMDGCN work.

On the AMDGCN side the page is nearly blank. amdgcn is already a real target architecture in the Zig compiler: it lives in the target tables, and test/llvm_targets.zig already lists amdgcn-amd-amdhsa. So the scaffolding for talking about the target exists. What doesn’t exist, at all, is a native backend that turns Zig’s IR⁷ into AMD machine code without leaning on LLVM. That’s the thing I’m setting out to build.

This is going to be hard, and that’s fine

I have no illusions here. A native GPU backend means understanding the ISA encoding, the 64-byte kernel descriptor, the metadata the loader reads, and the dispatch machinery: an entire stack that normally hides behind a compiler and a driver. It is genuinely difficult, and I will get a lot of it wrong in public.

But this was never about shipping a production backend. It’s a learning journey. If the destination turns out to be unreachable, I’ll have learned how GPUs boot code along the way, and that’s already worth the trip.

So I did the scariest part first: prove the AMD code-object format and loader by hand, with the compiler completely out of the loop. The smallest possible signal is an empty kernel that dispatches on real hardware without faulting.

Milestone A: an empty kernel that dispatches

Success here is nothing more than “it ran”: one workgroup, one lane, a completion signal that fires, no queue error and no page fault. That sounds trivial, but reaching it means the code-object format, the kernel descriptor, the loader, and the dispatch packet are all correct. That is most of the risk retired at once.

A reference to copy

Before writing anything by hand, I built a known-good template with clang:

clang --target=amdgcn-amd-amdhsa -mcpu=gfx1100 -mcode-object-version=5 -c reference.c

Dumped with llvm-objdump -d and llvm-readobj --notes, it is not the deliverable but a correct example to diff against. The descriptor and the metadata note are fiddly enough that you want something known-good beside you rather than building blind.

The three pieces of a code object

A loadable AMD code object is an ELF⁸ for amdgcn-amd-amdhsa with three things the loader cares about: the code, a 64-byte kernel descriptor⁹, and a metadata note.

The code is one instruction. s_endpgm ends the wavefront, and it encodes to the four bytes 00 00 b0 bf on gfx1100. The descriptor is a bit-packed structure the assembler synthesizes from an .amdhsa_kernel block; we ask for the bare minimum:

// The 64-byte kernel descriptor. The assembler synthesizes the bit-packed
// COMPUTE_PGM_RSRC words from these fields; we ask for the bare minimum.
	.rodata
	.p2align	6              // descriptor must be 64-byte aligned
	.amdhsa_kernel empty_kernel
		.amdhsa_next_free_vgpr 1
		.amdhsa_next_free_sgpr 1
		.amdhsa_wavefront_size32 1
	.end_amdhsa_kernel

s_endpgm.s:18-26

That block also creates the symbol empty_kernel.kd, which the loader hands back as the kernel object at dispatch time. The third piece is a msgpack¹⁰ NT_AMDGPU_METADATA note, the loader’s directory: it names the kernel, points at its .kd symbol, and declares the arguments and wavefront size.

// The msgpack metadata note the loader reads to find the kernel and its symbol.
	.amdgpu_metadata
---
amdhsa.version:
  - 1
  - 2
amdhsa.target: amdgcn-amd-amdhsa--gfx1100
amdhsa.kernels:
  - .name:            empty_kernel
    .symbol:          empty_kernel.kd
    .kernarg_segment_size: 0
    .kernarg_segment_align: 4
    .group_segment_fixed_size: 0
    .private_segment_fixed_size: 0
    .sgpr_count:      0
    .vgpr_count:      0
    .max_flat_workgroup_size: 1024
    .wavefront_size:  32
    .args:            []
...
	.end_amdgpu_metadata

s_endpgm.s:28-48

Whose loader, whose dispatch?

To put a kernel on the GPU you need some runtime to talk to the kernel driver. That is not a compromise of the no-LLVM goal, which is about codegen. The question is how much of the loading and dispatch we do ourselves. There’s a spectrum. HIP¹¹ hides everything behind hipModuleLaunchKernel. Raw KFD¹² ioctls do it all by hand, no ROCm at all. In between sits HSA¹³, the low-level runtime HIP is built on.

I chose HSA. Its loader parses the descriptor and metadata we wrote, so those get validated by something other than ourselves, but we still build the dispatch packet and ring the doorbell ourselves. That way “it ran” reflects our own understanding, not HIP’s.

Building the dispatch packet by hand

The host side is a small Zig program that talks to libhsa-runtime64. It loads the code object, looks up empty_kernel.kd, fills in an AQL¹⁴ kernel-dispatch packet field by field, publishes the header with a release fence so the packet processor only ever sees a fully-written packet, and rings the doorbell¹⁵:

    // Reserve a slot and build the AQL kernel-dispatch packet BY HAND.
    const index = c.hsa_queue_add_write_index_relaxed(q, 1);
    const mask: u64 = q.size - 1;
    const packets: [*]c.hsa_kernel_dispatch_packet_t = @ptrCast(@alignCast(q.base_address));
    const pkt = &packets[index & mask];

    pkt.* = std.mem.zeroes(c.hsa_kernel_dispatch_packet_t);
    pkt.unnamed_0.unnamed_0.setup = @as(u16, 1) << @as(u4, @intCast(c.HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS)); // 1-D grid
    pkt.workgroup_size_x = 1;
    pkt.workgroup_size_y = 1;
    pkt.workgroup_size_z = 1;
    pkt.grid_size_x = 1;
    pkt.grid_size_y = 1;
    pkt.grid_size_z = 1;
    pkt.group_segment_size = group_size;
    pkt.private_segment_size = private_size;
    pkt.kernel_object = kernel_object;
    pkt.kernarg_address = null; // no kernel arguments
    pkt.completion_signal = done;

    // Publish the header atomically (release) so the packet processor only ever
    // observes a fully-written packet.
    const header: u16 =
        (@as(u16, @intCast(c.HSA_PACKET_TYPE_KERNEL_DISPATCH)) << PH_TYPE) |
        (@as(u16, 1) << PH_BARRIER) |
        (@as(u16, @intCast(c.HSA_FENCE_SCOPE_SYSTEM)) << PH_ACQ) |
        (@as(u16, @intCast(c.HSA_FENCE_SCOPE_SYSTEM)) << PH_REL);
    @atomicStore(u16, @as(*u16, @ptrCast(pkt)), header, .release);

    // Ring the doorbell: tell the packet processor there's work at `index`.
    c.hsa_queue_store_write_index_relaxed(q, index + 1);
    c.hsa_signal_store_relaxed(q.doorbell_signal, @intCast(index));
    std.debug.print("doorbell rung, waiting for completion...\n", .{});

    // Block until the kernel signals done.
    const v = c.hsa_signal_wait_scacquire(done, COND_EQ, 0, std.math.maxInt(u64), WAIT_BLOCKED);
    std.debug.print("completion signal = {d}\n", .{v});
    std.debug.print("DISPATCHED empty_kernel: 1 workgroup, 1 lane, no fault. it ran.\n", .{});

dispatch.zig:108-145

Detours

rocminfo refused to run, printing “ROCk module is NOT loaded”. That turned out to be cosmetic: it checks /sys/module/amdgpu/initstate, which does not exist on this kernel. The HSA runtime itself initializes fine and sees gfx1100, so the check is a quirk, not a real problem.

It ran

GPU agent: gfx1100
executable frozen (loader accepted our descriptor + metadata)
doorbell rung, waiting for completion...
completion signal = 0
DISPATCHED empty_kernel: 1 workgroup, 1 lane, no fault. it ran.

The completion signal reaching zero is the whole ballgame. With the compiler nowhere in sight, we proved the code-object format, the kernel descriptor, the HSA loader, and a dispatch packet we built ourselves. Now we know exactly what the backend has to produce.

Next, the compiler finally enters the picture: the first job is to emit exactly this code object, but from inside Zig’s backend, driven by the compiler’s IR instead of hand-written assembly.