Improve interpreter for 43% boot time reduction by jserv · Pull Request #126 · sysprog21/semu

jserv · 2026-04-24T08:38:32Z

This rewrites the interpreter hot path with computed-goto dispatch, TLB precomputation, and system-level optimizations. Measured on Threadripper 2990WX (GCC 14.2, -O2, pinned core):

  Master: 3269ms avg (3248-3295ms)
  This:   1863ms avg (1857-1870ms)  -43.0%

Hardware counter comparison:

  Instructions:   26.86B -> 13.95B  (-48.1%)
  Cycles:         13.12B ->  7.53B  (-42.6%)
  Branches:        5.40B ->  2.19B  (-59.4%)
  Branch misses:    175M ->   121M  (-30.6%)
  IPC:             2.05  ->  1.85

Interpreter changes (riscv.c):

Computed-goto dispatch with dense 128-entry table
Per-funct3 sub-dispatch tables for OP_IMM and OP
Amortized interrupt checking (sequential path skips checks)
Block chaining: inline icache lookup at branch/jump targets
TLB data_minus_addr precomputation (single ADD on hit)
MRU-first open-coded TLB probe with xor-fold hash
Inlined JAL/JALR/BRANCH handlers (no function calls)
Conditional MMU flush on trap/sret (skip same-privilege)
attribute((hot, flatten)) on vm_step_many
attribute((cold)) on exception/trap paths

Memory subsystem (riscv.h, riscv.c):

hart_t struct reorder: hot fields at offset 0, cold caches at end
32-set x 2-way load/store TLB (up from 8 sets)
16-entry fetch TLB (up from 2 entries)
Sequential fetch: block-boundary check replaces pointer compare

System-level (main.c, uart.c, device.h):

UART output buffering (128-byte buffer, flush on newline)
512-step execution chunks (up from 128)
SBI IPI: only set targeted harts, never clear pending
SBI RFENCE.I: implemented via vm_fence_i
Hart mask bounds checking in IPI/RFENCE loops

Correctness:

Single-hart: boots Linux 6.12 to login
SMP=4: all 4 CPUs come online

Summary by cubic

Rewrite the RISC‑V interpreter hot path to cut boot time by ~43% and reduce instruction/branch counts. Adds buffered UART output with prompt-aware flushing and proper SBI RFENCE.I handling.

Refactors
- Computed-goto dispatch with small sub-dispatch tables; inlined JAL/JALR/BRANCH; block chaining via inline I‑cache lookup.
- MMU/TLB: load/store 32×2 sets and fetch TLB 16 entries; MRU-first probe; precomputed data_minus_addr with direct host RAM fast path.
- Faster sequential fetch (block-boundary check) and amortized interrupt checks; conditional MMU flush on trap/sret; I‑cache epoch-based invalidation for FENCE.I.
- System: UART output buffering with flush on newline/prompt and before input; periodic flush in tick; single-hart step chunk 512; SBI IPI only sets targeted harts; bounds-checked hart masks in IPI/RFENCE loops; SBI RFENCE.I via vm_fence_i; added bench-login make target.

^{Written for commit 31ed456. Summary will update on new commits.}

This rewrites the interpreter hot path with computed-goto dispatch, TLB precomputation, and system-level optimizations. Measured on Threadripper 2990WX (GCC 14.2, -O2, pinned core): Master: 3269ms avg (3248-3295ms) This: 1863ms avg (1857-1870ms) -43.0% Hardware counter comparison: Instructions: 26.86B -> 13.95B (-48.1%) Cycles: 13.12B -> 7.53B (-42.6%) Branches: 5.40B -> 2.19B (-59.4%) Branch misses: 175M -> 121M (-30.6%) IPC: 2.05 -> 1.85 Interpreter changes (riscv.c): - Computed-goto dispatch with dense 128-entry table - Per-funct3 sub-dispatch tables for OP_IMM and OP - Amortized interrupt checking (sequential path skips checks) - Block chaining: inline icache lookup at branch/jump targets - TLB data_minus_addr precomputation (single ADD on hit) - MRU-first open-coded TLB probe with xor-fold hash - Inlined JAL/JALR/BRANCH handlers (no function calls) - Conditional MMU flush on trap/sret (skip same-privilege) - __attribute__((hot, flatten)) on vm_step_many - __attribute__((cold)) on exception/trap paths Memory subsystem (riscv.h, riscv.c): - hart_t struct reorder: hot fields at offset 0, cold caches at end - 32-set x 2-way load/store TLB (up from 8 sets) - 16-entry fetch TLB (up from 2 entries) - Sequential fetch: block-boundary check replaces pointer compare System-level (main.c, uart.c, device.h): - UART output buffering (128-byte buffer, flush on newline) - 512-step execution chunks (up from 128) - SBI IPI: only set targeted harts, never clear pending - SBI RFENCE.I: implemented via vm_fence_i - Hart mask bounds checking in IPI/RFENCE loops Correctness: - Single-hart: boots Linux 6.12 to login - SMP=4: all 4 CPUs come online

This comment was marked as resolved.

Sign in to view

jserv force-pushed the fast-interpreter branch from 7a947f2 to 769d6e1 Compare April 24, 2026 08:51

jserv force-pushed the fast-interpreter branch from 769d6e1 to 31ed456 Compare April 24, 2026 09:11

jserv merged commit 3c2aaaa into master Apr 24, 2026
10 checks passed

jserv deleted the fast-interpreter branch April 24, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve interpreter for 43% boot time reduction#126

Improve interpreter for 43% boot time reduction#126
jserv merged 1 commit intomasterfrom
fast-interpreter

jserv commented Apr 24, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jserv commented Apr 24, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jserv commented Apr 24, 2026 •

edited by cubic-dev-ai Bot

Loading