what we build, and why
jump to resultsA chip is a giant logic puzzle
A digital chip is a network of tiny switches (transistors) wired into AND / OR / NOT gates. Those gates are wired into adders, registers, decoders, caches — and ultimately into a complete machine like a CPU.
You can think of a chip as a spreadsheet with millions of cells. Each cell holds a 0 or a 1. Some cells depend on others through simple rules. Every “clock tick”, every cell recomputes its value based on its neighbours, in lockstep.
Why simulate?
Before fabrication, designers verify the chip exhaustively in software. They describe the chip in Verilog (or SystemVerilog) — and a simulator runs it for billions of clock ticks against test cases, checking outputs against expected behaviour.
A bug that surfaces after 10 billion ticks is 28 hours at 100 KHz, or
17 minutes at 10 MHz.
Faster simulator → faster iteration → ship hardware features faster.
The state of the art
The fastest simulator is Verilator (open-source). It translates Verilog into C++, compiles it, and runs it on CPU. On a vexriscv RISC-V CPU (~44,000 gates)
it runs at ~1.25 MHz on a Mac. It's missing important features of commercial simulators, particularly for verification, but we'll use it as a benchmark for speed.
Commercial tools (Synopsys VCS, Cadence Xcelium — ~$500K/seat/year) are fully featured, but slower.
Verilator’s the bar for speed.
Where our simulator (arc) is different
Two ideas.
Verification is parallel, not serial
A CPU simulator runs one test stream at a time. A real verification campaign is thousands of tests - random programs, fuzzed inputs, property checks, regression suites.
This is what GPUs are built for - do the same computation on many inputs at once. arc’s GPU and threaded modes run hundreds of test streams in parallel, each evaluating the same chip with different stimulus. This is multistim mode.
Normalise to AIG so the GPU mapping works
To run on a GPU, the chip’s per-cycle logic has to fit into
limited GPU register memory (~64 registers per thread). RTL is too
big and too irregular. arc reduces it to an and-inverter graph —
every signal becomes either AND(a, b) or NOT(a).
Every signal is the same primitive now. The graph can be partitioned cleanly (uniform-cost nodes), vectorised, and mapped onto GPU thread blocks. The AIG is the bridge to the GPU.
What happens when you run arc
The frontend: slang
The first job is to read the SystemVerilog. This is harder than it sounds.
SystemVerilog has a preprocessor (macros, includes, conditional compilation), generate blocks (compile-time loops), parameterised modules
(the same code instantiated with different bit widths), packages, classes,
and a hierarchy of modules instantiating other modules.
arc uses slang — the only open-source parser that handles the full language. slang does two things:
- Parse — text → abstract syntax tree.
- Elaborate — resolve every module instantiation, expand every generate block, substitute every parameter value, fold every constant. The result is a flat tree of concrete modules with no remaining abstraction.
Out the other side comes a fully-elaborated design with every signal width and every constant fixed. That’s the starting point for everything arc does.
Inside one cycle
Whichever path arc takes, the per-cycle work is the same:
- Read inputs (from the testbench).
- Propagate combinational logic in topological order.
- Snapshot register inputs at the clock edge.
- Commit registers (D → Q).
- Re-propagate logic affected by the new register values.
- Emit outputs.
The CPU does this in one compiled function. The GPU does it in thousands of coordinated thread blocks.
Results
These results are generated by running arc on open-source RISC-V CPUs (vexriscv at 44K gates, vexriscv_min at 12K).
VexRiscV is a popular open-source 32-bit RISC-V CPU. It’s widely used in FPGA and ASIC
projects, and ships in two main configurations: vexriscv_min — a minimal pipeline at roughly 12,000 gates — and the full vexriscv with caches, branch prediction, and a multiplier,
at roughly 44,000 gates. Both are real, production-quality cores, and
both run on arc.
| Design | Verilator | arc | Ratio |
|---|---|---|---|
vexriscv_min(12K gates) | 1.1 MHz | 3.5 MHz | arc 3.2× |
vexriscv(44K gates) | 1.25 MHz | 0.67 MHz | Verilator 1.9× |
arc beats Verilator on smaller designs; on the largest design, Verilator’s decade of optimisation work still wins. we're working to close & exceed this gap.
| N stimuli | arc aggregate | Verilator (1-thread) | vs 1-thread |
|---|---|---|---|
N = 1 | 3.54 MHz | 1.62 MHz | 2.2× |
N = 2 | 6.77 MHz | 1.63 MHz | 4.1× |
N = 4 | 13.30 MHz | 1.61 MHz | 8.2× |
N=1 is the single-stim baseline; N>1 is rayon-threaded multistim. The per-thread efficiency is high — N=4 lands at ~3.8× the N=1 aggregate (95% scaling).
| N stimuli | arc aggregate | Verilator (1-thread) | Verilator (multi-threaded) | vs 1-thread | vs multi-threaded |
|---|---|---|---|---|---|
N = 1024 | 3.85 MHz | 290 KHz | 1.27 MHz | 13.3× | 3.04× |
For a chip team running a 1000-stimulus regression overnight, this is the difference between an hours-long wait and a coffee break.