DEMO

what we build, and why

jump to results
01

A chip is a giant logic puzzle

A digital chip is a network of tiny switches (transistors) wired into AND / OR / NOT gates. Those gates are wired into adders, registers, decoders, caches — and ultimately into a complete machine like a CPU.

You can think of a chip as a spreadsheet with millions of cells. Each cell holds a 0 or a 1. Some cells depend on others through simple rules. Every “clock tick”, every cell recomputes its value based on its neighbours, in lockstep.

cells→ next cycle
02

Why simulate?

10M–100M
cost of one fabrication run
6 months
turnaround per silicon revision
1 shot
to get it right before fab

Before fabrication, designers verify the chip exhaustively in software. They describe the chip in Verilog (or SystemVerilog) — and a simulator runs it for billions of clock ticks against test cases, checking outputs against expected behaviour.

A bug that surfaces after 10 billion ticks is 28 hours at 100 KHz, or 17 minutes at 10 MHz.
Faster simulator → faster iteration → ship hardware features faster.

03

The state of the art

The fastest simulator is Verilator (open-source). It translates Verilog into C++, compiles it, and runs it on CPU. On a vexriscv RISC-V CPU (~44,000 gates) it runs at ~1.25 MHz on a Mac. It's missing important features of commercial simulators, particularly for verification, but we'll use it as a benchmark for speed.

Commercial tools (Synopsys VCS, Cadence Xcelium — ~$500K/seat/year) are fully featured, but slower.

Verilator’s the bar for speed.

04

Where our simulator (arc) is different

Two ideas.

i

Verification is parallel, not serial

A CPU simulator runs one test stream at a time. A real verification campaign is thousands of tests - random programs, fuzzed inputs, property checks, regression suites.

This is what GPUs are built for - do the same computation on many inputs at once. arc’s GPU and threaded modes run hundreds of test streams in parallel, each evaluating the same chip with different stimulus. This is multistim mode.

CPU (verilator-style)stimchipresultone stream, one coreGPU / threaded (multistim)N stimsN chipsN resultsN streams, parallel hardware
ii

Normalise to AIG so the GPU mapping works

To run on a GPU, the chip’s per-cycle logic has to fit into limited GPU register memory (~64 registers per thread). RTL is too big and too irregular. arc reduces it to an and-inverter graph — every signal becomes either AND(a, b) or NOT(a).

RTL (mixed primitives)if (op == ADD)r = a + belse r = a & ~bnon-uniform: + & ~ ifflattenAIG (uniform primitives)&&¬¬&&¬every node: AND or NOT

Every signal is the same primitive now. The graph can be partitioned cleanly (uniform-cost nodes), vectorised, and mapped onto GPU thread blocks. The AIG is the bridge to the GPU.

05

What happens when you run arc

SystemVerilogparse + elaborateslangSynthIRstructure preservedCPU PATHsynth-AOT codegengcc -O2 → .sodlopen + run loopGPU PATHflatten to AIGpartition (KaHyPar)~256-gate chunksCUDA kernel dispatch~1 s cold-start~30 s cold-start, then very fast/cycle

The frontend: slang

The first job is to read the SystemVerilog. This is harder than it sounds. SystemVerilog has a preprocessor (macros, includes, conditional compilation), generate blocks (compile-time loops), parameterised modules (the same code instantiated with different bit widths), packages, classes, and a hierarchy of modules instantiating other modules.

arc uses slang — the only open-source parser that handles the full language. slang does two things:

  • Parse — text → abstract syntax tree.
  • Elaborate — resolve every module instantiation, expand every generate block, substitute every parameter value, fold every constant. The result is a flat tree of concrete modules with no remaining abstraction.

Out the other side comes a fully-elaborated design with every signal width and every constant fixed. That’s the starting point for everything arc does.

Inside one cycle

Whichever path arc takes, the per-cycle work is the same:

  1. Read inputs (from the testbench).
  2. Propagate combinational logic in topological order.
  3. Snapshot register inputs at the clock edge.
  4. Commit registers (D → Q).
  5. Re-propagate logic affected by the new register values.
  6. Emit outputs.

The CPU does this in one compiled function. The GPU does it in thousands of coordinated thread blocks.

07

Results

These results are generated by running arc on open-source RISC-V CPUs (vexriscv at 44K gates, vexriscv_min at 12K).

ABOUT THE DESIGN UNDER TEST

VexRiscV is a popular open-source 32-bit RISC-V CPU. It’s widely used in FPGA and ASIC projects, and ships in two main configurations: vexriscv_min — a minimal pipeline at roughly 12,000 gates — and the full vexriscv with caches, branch prediction, and a multiplier, at roughly 44,000 gates. Both are real, production-quality cores, and both run on arc.

Single-stim, CPU mode
one test stream, one CPU thread
DesignVerilatorarcRatio
vexriscv_min(12K gates)1.1 MHz3.5 MHzarc 3.2×
vexriscv(44K gates)1.25 MHz0.67 MHzVerilator 1.9×

arc beats Verilator on smaller designs; on the largest design, Verilator’s decade of optimisation work still wins. we're working to close & exceed this gap.

Multistim, CPU mode
vexriscv_min at N parallel test streams on the host CPU via rayon
N stimuliarc aggregateVerilator (1-thread)vs 1-thread
N = 13.54 MHz1.62 MHz2.2×
N = 26.77 MHz1.63 MHz4.1×
N = 413.30 MHz1.61 MHz8.2×

N=1 is the single-stim baseline; N>1 is rayon-threaded multistim. The per-thread efficiency is high — N=4 lands at ~3.8× the N=1 aggregate (95% scaling).

Multistim, GPU mode (H100)
full vexriscv at N=1024 stimuli on a Modal H100 — vs Verilator single-thread and multi-threaded Verilator
N stimuliarc aggregateVerilator (1-thread)Verilator (multi-threaded)vs 1-threadvs multi-threaded
N = 10243.85 MHz290 KHz1.27 MHz13.3×3.04×

For a chip team running a 1000-stimulus regression overnight, this is the difference between an hours-long wait and a coffee break.