DEMO

what we build, and why

A chip is a giant logic puzzle

A digital chip is a network of tiny switches (transistors) wired into AND / OR / NOT gates. Those gates are wired into adders, registers, decoders, caches — and ultimately into a complete machine like a CPU.

You can think of a chip as a spreadsheet with millions of cells. Each cell holds a 0 or a 1. Some cells depend on others through simple rules. Every “clock tick”, every cell recomputes its value based on its neighbours, in lockstep.

Why simulate?

10M–100M

cost of one fabrication run

6 months

turnaround per silicon revision

1 shot

to get it right before fab

Before fabrication, designers verify the chip exhaustively in software. They describe the chip in Verilog (or SystemVerilog) — and a simulator runs it for billions of clock ticks against test cases, checking outputs against expected behaviour.

A bug that surfaces after 10 billion ticks is 28 hours at 100 KHz, or 17 minutes at 10 MHz.
Faster simulator → faster iteration → ship hardware features faster.

The state of the art

The fastest simulator is Verilator (open-source). It translates Verilog into C++, compiles it, and runs it on CPU. On a vexriscv RISC-V CPU (~44,000 gates) it runs at ~1.25 MHz on a Mac. It's missing important features of commercial simulators, particularly for verification, but we'll use it as a benchmark for speed.

Commercial tools (Synopsys VCS, Cadence Xcelium — ~$500K/seat/year) are fully featured, but slower.

Verilator’s the bar for speed.

Where our simulator (arc) is different

Two ideas.

Verification is parallel, not serial

A CPU simulator runs one test stream at a time. A real verification campaign is thousands of tests - random programs, fuzzed inputs, property checks, regression suites.

This is what GPUs are built for - do the same computation on many inputs at once. arc’s GPU and threaded modes run hundreds of test streams in parallel, each evaluating the same chip with different stimulus. This is multistim mode.

Normalise to AIG so the GPU mapping works

To run on a GPU, the chip’s per-cycle logic has to fit into limited GPU register memory (~64 registers per thread). RTL is too big and too irregular. arc reduces it to an and-inverter graph — every signal becomes either AND(a, b) or NOT(a).

Every signal is the same primitive now. The graph can be partitioned cleanly (uniform-cost nodes), vectorised, and mapped onto GPU thread blocks. The AIG is the bridge to the GPU.

What happens when you run arc

The frontend: slang

The first job is to read the SystemVerilog. This is harder than it sounds. SystemVerilog has a preprocessor (macros, includes, conditional compilation), generate blocks (compile-time loops), parameterised modules (the same code instantiated with different bit widths), packages, classes, and a hierarchy of modules instantiating other modules.

arc uses slang — the only open-source parser that handles the full language. slang does two things:

Parse — text → abstract syntax tree.
Elaborate — resolve every module instantiation, expand every generate block, substitute every parameter value, fold every constant. The result is a flat tree of concrete modules with no remaining abstraction.

Out the other side comes a fully-elaborated design with every signal width and every constant fixed. That’s the starting point for everything arc does.

Inside one cycle

Whichever path arc takes, the per-cycle work is the same:

Read inputs (from the testbench).
Propagate combinational logic in topological order.
Snapshot register inputs at the clock edge.
Commit registers (D → Q).
Re-propagate logic affected by the new register values.
Emit outputs.

The CPU does this in one compiled function. The GPU does it in thousands of coordinated thread blocks.

Results

These results are generated by running arc on open-source RISC-V CPUs (vexriscv at 44K gates, vexriscv_min at 12K).

ABOUT THE DESIGN UNDER TEST

VexRiscV is a popular open-source 32-bit RISC-V CPU. It’s widely used in FPGA and ASIC projects, and ships in two main configurations: vexriscv_min — a minimal pipeline at roughly 12,000 gates — and the full vexriscv with caches, branch prediction, and a multiplier, at roughly 44,000 gates. Both are real, production-quality cores, and both run on arc.

github.com/spinalhdl/vexriscv →

Single-stim, CPU mode

one test stream, one CPU thread

Design	Verilator	arc	Ratio
`vexriscv_min`(12K gates)	1.1 MHz	3.5 MHz	arc 3.2×
`vexriscv`(44K gates)	1.25 MHz	0.67 MHz	Verilator 1.9×

arc beats Verilator on smaller designs; on the largest design, Verilator’s decade of optimisation work still wins. we're working to close & exceed this gap.

Multistim, CPU mode

vexriscv_min at N parallel test streams on the host CPU via rayon

N stimuli	arc aggregate	Verilator (1-thread)	vs 1-thread
`N = 1`	3.54 MHz	1.62 MHz	2.2×
`N = 2`	6.77 MHz	1.63 MHz	4.1×
`N = 4`	13.30 MHz	1.61 MHz	8.2×

N=1 is the single-stim baseline; N>1 is rayon-threaded multistim. The per-thread efficiency is high — N=4 lands at ~3.8× the N=1 aggregate (95% scaling).

Multistim, GPU mode (H100)

full vexriscv at N=1024 stimuli on a Modal H100 — vs Verilator single-thread and multi-threaded Verilator

N stimuli	arc aggregate	Verilator (1-thread)	Verilator (multi-threaded)	vs 1-thread	vs multi-threaded
`N = 1024`	3.85 MHz	290 KHz	1.27 MHz	13.3×	3.04×

For a chip team running a 1000-stimulus regression overnight, this is the difference between an hours-long wait and a coffee break.