Zig64

Baby Steps
Login

Baby Steps

Yesterday I published the source code (and a demo video) for what's, to my knowledge, the first ever Nintendo 64 ROM written in Zig (and also the first N64 ROM built using Zig's build system). Naturally, it doesn't do much yet; it only prints a hello-world-equivalent debug message (either over a SummerCart 64's USB port or in the debug output of an IS-Viewer-compatible emulator), and writes some test bytes to DMEM. Still, we all gotta start somewhere!

In this post I aim to cover some of the snags I hit, how I ended up overcoming them, and some improvement points and next steps as I continue to iterate on this.

Prior art

Like I said, as far as I know this is uncharted territory; I've searched online for existing examples of Zig running on an N64 and found none. From what I can find, there were (prior to yesterday) exactly four (maybe five) ways of going about writing N64 software:

The fact that Rust was indeed an option made it clear that LLVM (and therefore Zig) ain't out of the running. Nonetheless, while there's certainly an exciting feeling about being some sort of trailblazer or whatever, but it's also kinda terrifying: it means a lot of my troubleshooting process entailed translating from these other programming languages, studying the documentation compiled over the years on the N64 hardware, and in a lot of cases making (un)educated guesses and seeing what worked and what didn't.

Still, even if the intersection of Zig and the N64 is uncharted territory, there are plenty of resources (in addition to the above) that were helpful in figuring things out:

Windfalls

In a lot of ways, targeting the N64 is surprisingly easy with Zig. My first pass at this (as you can see in the src.borked/ folder) ultimately didn't pan out (yet), but even with that utter failure of an example, Zig's build system was readily able to output something vaguely resembling an N64 ROM. Generating code was really as simple as setting the right target query in build.zig, like so:

const n64_target_query = std.Target.Query{
    .cpu_arch = std.Target.Cpu.Arch.mips,
    .cpu_model = .{.explicit = &std.Target.mips.cpu.mips3},
    .os_tag = std.Target.Os.Tag.freestanding,
    .abi = std.Target.Abi.none,
};

// ...

const elf = b.addExecutable(.{
    .name = "zig64.elf",
    .root_source_file = b.path("src/main.zig"),
    .target = b.resolveTargetQuery(n64_target_query),
    .optimize = .Debug,
});
elf.setLinkerScript(b.path("src/n64.ld"));
b.installArtifact(elf);

With the right linker script directives it's even possible to generate the header every ROM has and shove it in front of the output binary (when it's exported into "bin" format). If it wasn't for the N64's pesky need for IPL3 bootcode in every ROM, I would've been smooth sailing. Unfortunately, trying to implement my own IPL3 right off the bat was biting off way more than I could chew, so I ended up pivoting away from that for now.

But that brings me to another surprisingly-easy thing to do: building C executables and using those same executables in later build steps. Biting the bullet and "borrowing" libdragon's IPL3 boot infrastructure required only two things:

n64tool is a single-C-source-file program, so building it is straightforward:

const n64tool = b.addExecutable(.{
    .name = "n64tool",
    .target = b.standardTargetOptions(.{}),
    .optimize = b.standardOptimizeOption(.{}),
});
n64tool.linkLibC();
n64tool.addCSourceFiles(.{
    .files = &.{"vendor/n64tool/n64tool.c"},
    .flags = &.{
        "--std=c23", // Needed for typeof()
        "-Wall",
        "-Werror",
        "-Wno-unused-result",
        "-Wno-error=unknown-pragmas",
        "-Wno-sign-compare",
}});
b.installArtifact(n64tool);

The only snag, as you might've noticed, was that n64tool.c uses typeof(), which (from what I gather) was originally a GCC extension. It's part of C23 now, though, so adding --std=c23 to the compiler flags was enough to keep zig cc from choking on it.

With n64tool built, all that's left is to actually run it:

const makerom = b.addRunArtifact(n64tool);
makerom.addArgs(&.{
    "--title",
    "Zig N64 Demo",
    "--header",
    "vendor/ipl3_dev.z64",
});
makerom.addArg("--output");
const rom = makerom.addOutputFileArg("zig64.z64");
makerom.addArgs(&.{"--align", "256"});
makerom.addArtifactArg(elf);
b.getInstallStep().dependOn(&b.addInstallFileWithDir(rom, .prefix, "zig64.z64").step);

Easy peasy lemon squeezy! ...right?

Snags

To start, I'd like to tell you a story about this thing called the "Reality Coprocessor", or RCP. For those not well-versed in Nintendo hardware, the N64 has two CPUs: the primary 64-bit VR4300 CPU, and a secondary 32-bit coprocessor - the Reality Coprocessor. The VR4300 CPU in the N64 does not directly talk to even the RAM, let alone to any of the other devices like the cartridge or controllers or what have you. The only thing the N64's main CPU talks to is the RCP. These two processors talk to each other over a 32-bit "SysAD" bus and a 5-bit "SysCMD" bus: SysAD holds the address or data to read/write, and SysCMD holds whether to read or write the data and the size of the data to read/write. SysAD always works in terms of 32 bits, so every time you read, you're reading 32 bits, and every time you write, you're writing 32 bits.

So what if you want to work with something smaller than 32 bits? Say, a single byte? In case of reads, this is easy: even though SysAD will give the CPU four bytes, the CPU can trivially isolate the byte it actually wants and ignore the rest. Easy peasy lemon squeezy. In the case of writes, the CPU puts the byte on SysAD, and then in SysCMD it sets bits 1 and 0 to 0 (meaning an 8-bit write) and expects the device on the other side of SysAD/SysCMD to check that. Easy peasy lemon squeezy, right?

WRONG. See, when writing to RDRAM this does work as expected, because the RCP talks to RDRAM over the aptly-named "RAM Interface" (RI), and the RI is smart enough to notice "oh, I'm only supposed to write 8 bits, so I'll ignore the rest of the word and only write the 8 bits I'm supposed to write". However, for everything else the RCP talks to, its implementation of SysAD/SysCMD is "simplified", meaning too simple to understand that there's a size being passed in. The RCP gets the write request and says "Wow! Time to write this whole 32-bit word! ...what's this? The CPU only wants me to write one byte? That sign can't stop me because I can't read!". And so, not only does it write the one byte, but it also completely clobbers the other three bytes in that 32-bit word. Things just got difficult difficult lemon difficult.

What does this mean for Zig, though? Well, say we want to write to an array/slice/buffer/whatever of bytes that lives outside of RDRAM - say, a data buffer for writing text to an ISViewer-compatible emulator's debug output:

const std = @import("std");
var isviewer_buffer: *[0x200]u8 = @ptrFromInt(0xb3ff0020);
var isviewer_writelen: *volatile u32 = @ptrFromInt(0xb3ff0014);
const message = "Hello, world!";
std.mem.copyForwards(isviewer_buffer, message);
isviewer_writelen = message.len;

This will compile fine. It'll even run "fine", with no panics or CPU crashes. But what you'll find is that your emulator won't be printing "Hello, world!". If it prints anything at all, it'll print something like "l,!" or "H r" or some other nonsense. What gives? Well, let's look at what std.mem.copyForward() does:

/// Copy all of source into dest at position 0.
/// dest.len must be >= source.len.
/// If the slices overlap, dest.ptr must be <= src.ptr.
pub fn copyForwards(comptime T: type, dest: []T, source: []const T) void {
    for (dest[0..source.len], source) |*d, s| d.* = s;
}

Well darn, it literally just copies each and every byte one by one. In doing so, only one out of every four bytes is going to actually make it into the buffer, with the other three getting clobbered (with zeroes, from what I've observed). This won't do; we need something that'll combine every four bytes into a single 32-bit word, and then write those words. Something like this:

fn writeBytes(dest: []align(4) u8, src: []const u8) void {
    const chunked_len = src.len / 4;
    const dest_chunked: []u32 = @as([*]u32, @ptrCast(dest))[0..chunked_len + 1];
    for (dest_chunked[0..chunked_len], 0..) |*d, i| {
        const s = bytesToWord(src[(i*4)..]);
        d.* = s;
    }
    const extra = bytesToWord(src[(chunked_len * 4)..]);
    dest_chunked[chunked_len] = extra;
}

fn bytesToWord(bytes: []const u8) u32 {
    var buf: [4]u8 = .{0,0,0,0};
    if (bytes.len > 0) buf[0] = bytes[0];
    if (bytes.len > 1) buf[1] = bytes[1];
    if (bytes.len > 2) buf[2] = bytes[2];
    if (bytes.len > 3) buf[3] = bytes[3];
    return std.mem.readInt(u32, &buf, .big);
}

Now if we write to our buffer:

const std = @import("std");
var isviewer_buffer: *align(4) [0x200]u8 = @ptrFromInt(0xb3ff0020);
var isviewer_writelen: *volatile u32 = @ptrFromInt(0xb3ff0014);
const message = "Hello, world!";
writeBytes(isviewer_buffer, message);
isviewer_writelen = message.len;

It'll work... almost. See, we've got another snag here: not only does the RCP write whole words to non-RDRAM addresses, but it does so asynchronously. This is arguably a good thing, since some of these other devices (namely: the "Peripheral Interface" (PI) and "Serial Interface" (SI)) can be pretty slow and there are surely better things we can be doing while the RCP deals with that slowness for us. The bad news, though, is that accessing, say, the PI while it's busy handling a previous operation will cause all sorts of hilarious shenanigans (read: data corruption) to ensue.

Thankfully, these non-RDRAM devices all have status registers that tell us whether they're busy and what they're busy with. Looking at our handy dandy memory map, we can see that both isviewer_buffer and isviewer_writelen above are pointing to addresses in "KSEG1", which directly mirrors the physical memory map at offset 0xA0000000, and subtracting 0xA0000000 from 0xB3FF0020 gives us 0x13FF0020, which is within one of the address ranges mapped to the "Peripheral Interface" (PI). The PI's status register looks something like this:

const PI = struct {
    // ...
    const Status = packed struct(u32) {
        dma_busy: bool,
        io_busy: bool,
        dma_error: bool,
        dma_complete: bool,
        reserved: u28,
    };
    var status: *volatile Status = @ptrFromInt(0xa4600010);
    // ...
};

(This, by the way, reminds me of another minor snag I hit: Zig packed structs are little-endian, i.e. least-significant bit comes first, while the vast majority of documentation out there on N64 bit layouts is big-endian, i.e. most-significant-bit comes first. Moral of the story: put your fields down, flip it and reverse it (ti esrever dna ti pilf nwod sdleif ruoy tup...).)

Avoiding those aforementioned "hilarious shenanigans" is as simple as just waiting for both PI.status.dma_busy and PI.status.io_busy to be false:

const PI = struct {
    // ...
    fn wait() void {
        while (status.io_busy or status.dma_busy) {}
    }
};

And now we can fix our writeBytes():

fn writeBytes(dest: []align(4) u8, src: []const u8) void {
    const chunked_len = src.len / 4;
    const dest_chunked: []u32 = @as([*]u32, @ptrCast(dest))[0..chunked_len + 1];
    for (dest_chunked[0..chunked_len], 0..) |*d, i| {
        const s = bytesToWord(src[(i*4)..]);
        PI.wait();
        d.* = s;
    }
    const extra = bytesToWord(src[(chunked_len * 4)..]);
    PI.wait();
    dest_chunked[chunked_len] = extra;
}

fn bytesToWord(bytes: []const u8) u32 {
    var buf: [4]u8 = .{0,0,0,0};
    if (bytes.len > 0) buf[0] = bytes[0];
    if (bytes.len > 1) buf[1] = bytes[1];
    if (bytes.len > 2) buf[2] = bytes[2];
    if (bytes.len > 3) buf[3] = bytes[3];
    return std.mem.readInt(u32, &buf, .big);
}

And our logging code:

const std = @import("std");
var isviewer_buffer: *align(4) [0x200]u8 = @ptrFromInt(0xb3ff0020);
var isviewer_writelen: *volatile u32 = @ptrFromInt(0xb3ff0014);
const message = "Hello, world!";
writeBytes(isviewer_buffer, message);
PI.wait();
isviewer_writelen = message.len;

Now this all raises an important question: "how in the heck did you troubleshoot this?" Notice that I hit these snags in the process of getting even so much as basic debug logging implemented, let alone the sorts of infrastructure necessary to point GDB at the running ROM via something like UNFLoader.

The good news is that the Ares emulator has some extensive debugging options, including the ability to log each and every instruction the (emulated) CPU executes. The other good news is that Godbolt exists, and it's easy enough to shove -target mips-freestanding-none -mcpu mips3 into the compiler flags and inspect the generated assembly.

The bad news is that this is a very tedious process. Also, the assembly output from Godbolt doesn't quite match the instruction output from Ares:

Needless to say, following the execution flow in Godbolt wasn't exactly trivial, but eventually I got used to those quirks and was able to quickly figure out "okay, this particular instruction from the CPU has some unusual-looking arguments, I bet if I Ctrl-F for it in the assembly output I can narrow down this code to the right function". For example, based on output like this:

CPU  ffffffff80001310  subiu   sp,sp{$807e9790},$20
CPU  ffffffff80001314  sw      ra{$80000d6c},sp+$1c{$807e978c}
CPU  ffffffff80001318  sw      s8{$807e9790},sp+$18{$807e9788}
CPU  ffffffff8000131c  or      s8,sp{$807e9770},0
CPU  ffffffff80001320  sw      a0{$00000008},s8+$0{$807e9770}
CPU  ffffffff80001324  sw      a1{$00000008},s8+$4{$807e9774}
CPU  ffffffff80001328  sw      ra{$80000d6c},s8+$8{$807e9778}
CPU  ffffffff8000132c  liu     at,$00000001
CPU  ffffffff80001330  sb      at{$00000001},s8+$c{$807e977c}
CPU  ffffffff80001334  sw      a0{$00000008},s8+$10{$807e9780}
CPU  ffffffff80001338  sw      a1{$00000008},s8+$14{$807e9784}
CPU  ffffffff8000133c  liu     a0,$00000000
CPU  ffffffff80001340  addiu   a1,s8{$807e9770},$8
CPU  ffffffff80001344  addiu   a2,s8{$807e9770},$10
CPU  ffffffff80001348  jal     $800017f8
CPU  ffffffff8000134c  nop

I can identify, say, that sw a1, s8+$14, convert it to sw $5, 20($fp), and then see if the surrounding instructions in the assembly match the surrounding instructions in the logs. In this case, there happened to be two functions that matched (because they're identical except for the target of that last jal):

builtin.panicStartGreaterThanEnd:
        addiu   $sp, $sp, -32
        sw      $ra, 28($sp)
        sw      $fp, 24($sp)
        move    $fp, $sp
        sw      $4, 0($fp)
        sw      $5, 4($fp)
        sw      $ra, 8($fp)
        addiu   $1, $zero, 1
        sb      $1, 12($fp)
        sw      $4, 16($fp)
        sw      $5, 20($fp)
        addiu   $4, $zero, 0
        addiu   $5, $fp, 8
        addiu   $6, $fp, 16
        jal     debug.panicExtra__anon_1135
        nop

builtin.panicOutOfBounds:
        addiu   $sp, $sp, -32
        sw      $ra, 28($sp)
        sw      $fp, 24($sp)
        move    $fp, $sp
        sw      $4, 0($fp)
        sw      $5, 4($fp)
        sw      $ra, 8($fp)
        addiu   $1, $zero, 1
        sb      $1, 12($fp)
        sw      $4, 16($fp)
        sw      $5, 20($fp)
        addiu   $4, $zero, 0
        addiu   $5, $fp, 8
        addiu   $6, $fp, 16
        jal     debug.panicExtra__anon_1136
        nop

Since these are the only two matches, I know with pretty darn good certainty that the Zig code running on the CPU panicked, most likely while indexing into some array or slice (and indeed, in this case it was when implementing that above-described "write bytes four at a time" function, due to an off-by-one error on my part when reading the last chunk of bytes from the source buffer).

Next steps

I'd like this to evolve into an actual package others can use for their own Zig-on-N64 projects. Some code cleanup and reorganization is in order to make that happen. Given the above-mentioned snags around RSP-managed memory writes, I'll probably convert some of those []u8s to []u32s if I don't expect to need to read individual bytes from them.

Obviously I'd like to get something on the screen. Even if it's just setting the background color or something, it's at least a start. That'll give me an opportunity to implement some code around the RDP ("Reality Display Processor") and the VI ("Video Interface"). That'll put me on the right footing to start drawing rectangles and triangles. I don't think I'll try writing my own RSP microcode yet (especially since I have no idea how I'd go about compiling it from within Zig), but it'd be neat to be able to use existing third-party microcodes like Tiny3D.

Now that I know how to get some kind of text output going, I'd like to make a second pass at implementing the IPL3 loader in Zig. That'll eliminate the dependency on libdragon's IPL3 and n64tool, making this a pure-Zig codebase. The tricky part is that part of the IPL3's job is to initialize RDRAM, meaning that the only RAM I'd have access to is the RAM within the RCP itself: two 4096-byte blocks called IMEM and DMEM (normally used by the "Reality Signal Processor" (RSP), which is the other MIPS CPU in every Nintendo 64). The saving grace is that the CPU can execute instructions directly from the cartridge (albeit slowly), so it's possible to offload all but the bare essentials of the bootloader onto the cartridge, and then only copy them over once RDRAM is up and running... but the flip side of that is that this happens over the PI bus, which means that I can't execute code from the cartridge and run PI I/O at the same time. libdragon's IPL3 has some tricks up its sleeve to manage all this, and it was those tricks that had me tangled up the first time I tried going down that path; now that I've got my bearings, it'll hopefully be more feasible.

Speaking of that text output, currently my code only outputs via SummerCart 64 or IS-Viewer. I have an Everdrive 64 v7 on my desk as well, so I'd like to get USB support for that implemented as well, which would round out the three options that seem to be most common for homebrew development. Libdragon also supports the 64Drive and the iQue for debug logging, and I can probably do the same, but I have neither of those so I wouldn't be able to test them. At some point getting input would be nice, too.

Once I've overcome those milestones, I should be squared away to start tackling what's needed to actually make games: flesh out the PI/SI interface code, handle controller inputs, get audio playback working, etc. That's all probably a ways out, though. This baby's only taken its very first steps, after all! Can't jump straight from first steps to running marathons :)


Attachments: