z386: An Open-Source 80386 Built Around Original Microcode

This is the fifth installment of the 80386 series. The FPGA CPU is now far enough along to run real software, and this post is about how it works. z386 is a 386-class CPU built around the original Intel microcode, in the same spirit as z8086.

The core is not an instruction-by-instruction emulator in RTL. The goal is to recreate enough of the original machine that the recovered 386 control ROM can drive it. Today z386 boots DOS 6 and DOS 7, runs protected-mode programs like DOS/4GW and DOS/32A, and plays games like Doom and Cannon Fodder. Here are some rough numbers against ao486:

Metric z386 ao486
Lines of code (cloc) 8K 17.6K
ALUTs 18K 21K
Registers 5K 6.5K
BRAM 116K 131K
FPGA clock 85MHz 90MHz
3DBench FPS 34 43
Doom (original) FPS, max details 16.5 21.0

In current builds, z386 performs like a fast (~70MHz) cached 386-class machine, or a low-end 486. It runs at a much higher clock than historical 386 CPUs, but with somewhat worse CPI (cycles per instruction). The current cache is a 16 KB, 4-way set-associative unified L1, chosen partly to keep the clock high. Real high-end 386 systems often used larger external caches, typically in the 32 KB to 128 KB range.

Doom II running on z386
Doom II running on z386.

Much of this 386 microarchitecture archaeology has already been covered in the previous four posts: the multiplication/division datapath, the barrel shifter, protection and paging, and the memory pipeline. z386 tries to be both an educational reconstruction and a usable FPGA CPU. It keeps many 386-like structures: a 32-entry paging TLB, a barrel shifter shaped like the original, ROM/PLA-style decoding, the Protection PLA model, and most importantly the 37-bit-wide, 2,560-entry microcode ROM. At the same time, it uses FPGA-friendly shortcuts where they make sense, such as DSP blocks for multiplication and the small fast L1 cache.

In this post, I will fill in the rest of the design: instruction prefetch, decode, the microcode sequencer, cache design, testing, how z386 differs from ao486, and some lessons from the bring-up.

Read more →

80386 Memory Pipeline

The FPGA 386 core I've been building now boots DOS, runs applications like Norton Commander, and plays games like Doom. On DE10-Nano it currently runs at 75 MHz. With the core now far enough along to run real software, this seems like a good point to step back and look at one of the 80386's performance-critical subsystems: its memory pipeline.

32-bit Protected Mode was the defining feature of the 80386. In the previous post, I looked at one side of that story: the virtual-memory protection mechanisms. We saw how the 80386 implements protection with a dedicated PLA, segment caches, and a hardware page walker. This time I want to look at virtual memory from a different angle: the microarchitecture of the memory access pipeline, how address translation is made efficient, how microcode drives the process, and what kind of RTL timing the design achieves.

On paper, x86 virtual memory management looks expensive. Every memory reference seems to require effective address calculation, segment relocation, limit checking, TLB lookup, and, on a miss, two page-table reads plus Accessed/Dirty-bit updates. Yet Intel's own 1986 IEEE ICCD paper, Jim Slager's Performance Optimizations of the 80386, describes the common-case address path as completing in about 1.5 clocks. How did the 386 pull that off?

The answer is that virtual memory is not really a serial chain of checks, even if the diagrams make it look that way. It is a carefully overlapped memory pipeline that uses pre-calculation, pipelining, and parallelism to keep the common case surprisingly short.

Read more →

80386 Protection

I'm building an 80386-compatible core in SystemVerilog and blogging the process. In the previous post, we looked at how the 386 reuses one barrel shifter for all shift and rotate instructions. This time we move from real mode to protected and talk about protection.

The 80286 introduced "Protected Mode" in 1982. It was not popular. The mode was difficult to use, lacked paging, and offered no way to return to real mode without a hardware reset. The 80386, arriving three years later, made protection usable -- adding paging, a flat 32-bit address space, per-page User/Supervisor control, and Virtual 8086 mode so that DOS programs could run inside a protected multitasking system. These features made possible Windows 3.0, OS/2, and early Linux.

The x86 protection model is notoriously complex, with four privilege rings, segmentation, paging, call gates, task switches, and virtual 8086 mode. What's interesting from a hardware perspective is how the 386 manages this complexity on a 275,000-transistor budget. The 386 employs a variety of techniques to implement protection: a dedicated PLA for protection checking, a hardware state machine for page table walks, segment and paging caches, and microcode for everything else.

Read more →

80386 Barrel Shifter

I’m currently building an 80386-compatible core in SystemVerilog, driven by the original Intel microcode extracted from real 386 silicon. Real mode is now operational in simulation, with more than 10,000 single-instruction test cases passing successfully, and work on protected-mode features is in progress. In the course of this work, corners of the 386 microcode and silicon have been examined in detail; this series documents the resulting findings.

In the previous post, we looked at multiplication and division -- iterative algorithms that process one bit per cycle. Shifts and rotates are a different story: the 386 has a dedicated barrel shifter that completes an arbitrary multi-bit shift in a single cycle. What's interesting is how the microcode makes one piece of hardware serve all shift and rotate variants -- and how the complex rotate-through-carry instructions are handled.

Read more →

80386 Multiplication and Division

When Intel released the 80386 in October 1985, it marked a watershed moment for personal computing. The 386 was the first 32-bit x86 processor, increasing the register width from 16 to 32 bits and vastly expanding the address space compared to its predecessors. This wasn't just an incremental upgrade—it was the foundation that would carry the PC architecture for decades to come.

Read more →

New Site Design

Happy New Year! The site has a fresh new design. I’ve replaced Hugo with a minimal, custom static site generator that suits my needs much better.

Read more →

z8086: Rebuilding the 8086 from Original Microcode

After 486Tang, I wanted to go back to where x86 started. The result is z8086: a 8086/8088 core that runs the original Intel microcode. Instead of hand‑coding hundreds of instructions, the core loads the recovered 512x21 ROM and recreates the micro‑architecture the ROM expects.

z8086 is compact and FPGA‑friendly: it runs on a single clock domain, avoids vendor-specific primitives, and offers a simple external bus interface. Version 0.1 is about 2000 lines of SystemVerilog, and on a Gowin GW5A device, it uses around 2500 LUTs with a maximum clock speed of 60 MHz. The core passes all ISA test vectors, boots small programs, and can directly control peripherals like an SPI display. While it doesn’t boot DOS yet, it’s getting close.

Read more →

8086 Microcode Browser

Since releasing 486Tang, I’ve been working on recreating the 8086 with a design that stays as faithful as possible to the original chip. That exploration naturally led me deep into the original 8086 microcode — extracted and disassembled by Andrew Jenner in 2020.

Like all microcoded CPUs, the 8086 hides a lot of subtle behavior below the assembly layer. While studying it I kept extensive notes, and those eventually evolved into something more useful: an interactive browser for the entire 8086 microcode ROM.

Read more →

486Tang - 486 on a credit-card-sized FPGA board

Yesterday I released 486Tang v0.1 on GitHub. It’s a port of the ao486 MiSTer PC core to the Sipeed Tang Console 138K FPGA. I’ve been trying to get an x86 core running on the Tang for a while. As far as I know, this is the first time ao486 has been ported to a non-Altera FPGA. Here’s a short write‑up of the project.

Read more →

MCU for Better FPGA Gaming on Tang Console

A year ago, I added a softcore CPU to SNESTang, to make FPGA gaming cores easier to use. Over the past months, this allowed me to implement features like an improved menu system and core switching. While the softcore served its purpose, its limitations—slow performance, inability to handle complex peripherals like USB, and FPGA resource consumption—became apparent. Now is again the time to introduce some changes. After extensive collaboration with the Sipeed team, we've finally found a way to tap the Tang boards' onboard MCU (a Bouffalo BL616 chip) to address these challenges. The result is TangCore 0.6, along with all four gaming cores. In this post, I'll discuss integrating the MCU with the Tang gaming cores.

Read more →