Skip to content

Improve interpreter for 43% boot time reduction#126

Merged
jserv merged 1 commit intomasterfrom
fast-interpreter
Apr 24, 2026
Merged

Improve interpreter for 43% boot time reduction#126
jserv merged 1 commit intomasterfrom
fast-interpreter

Conversation

@jserv
Copy link
Copy Markdown
Collaborator

@jserv jserv commented Apr 24, 2026

This rewrites the interpreter hot path with computed-goto dispatch, TLB precomputation, and system-level optimizations. Measured on Threadripper 2990WX (GCC 14.2, -O2, pinned core):

  Master: 3269ms avg (3248-3295ms)
  This:   1863ms avg (1857-1870ms)  -43.0%

Hardware counter comparison:

  Instructions:   26.86B -> 13.95B  (-48.1%)
  Cycles:         13.12B ->  7.53B  (-42.6%)
  Branches:        5.40B ->  2.19B  (-59.4%)
  Branch misses:    175M ->   121M  (-30.6%)
  IPC:             2.05  ->  1.85

Interpreter changes (riscv.c):

  • Computed-goto dispatch with dense 128-entry table
  • Per-funct3 sub-dispatch tables for OP_IMM and OP
  • Amortized interrupt checking (sequential path skips checks)
  • Block chaining: inline icache lookup at branch/jump targets
  • TLB data_minus_addr precomputation (single ADD on hit)
  • MRU-first open-coded TLB probe with xor-fold hash
  • Inlined JAL/JALR/BRANCH handlers (no function calls)
  • Conditional MMU flush on trap/sret (skip same-privilege)
  • attribute((hot, flatten)) on vm_step_many
  • attribute((cold)) on exception/trap paths

Memory subsystem (riscv.h, riscv.c):

  • hart_t struct reorder: hot fields at offset 0, cold caches at end
  • 32-set x 2-way load/store TLB (up from 8 sets)
  • 16-entry fetch TLB (up from 2 entries)
  • Sequential fetch: block-boundary check replaces pointer compare

System-level (main.c, uart.c, device.h):

  • UART output buffering (128-byte buffer, flush on newline)
  • 512-step execution chunks (up from 128)
  • SBI IPI: only set targeted harts, never clear pending
  • SBI RFENCE.I: implemented via vm_fence_i
  • Hart mask bounds checking in IPI/RFENCE loops

Correctness:

  • Single-hart: boots Linux 6.12 to login
  • SMP=4: all 4 CPUs come online

Summary by cubic

Rewrite the RISC‑V interpreter hot path to cut boot time by ~43% and reduce instruction/branch counts. Adds buffered UART output with prompt-aware flushing and proper SBI RFENCE.I handling.

  • Refactors
    • Computed-goto dispatch with small sub-dispatch tables; inlined JAL/JALR/BRANCH; block chaining via inline I‑cache lookup.
    • MMU/TLB: load/store 32×2 sets and fetch TLB 16 entries; MRU-first probe; precomputed data_minus_addr with direct host RAM fast path.
    • Faster sequential fetch (block-boundary check) and amortized interrupt checks; conditional MMU flush on trap/sret; I‑cache epoch-based invalidation for FENCE.I.
    • System: UART output buffering with flush on newline/prompt and before input; periodic flush in tick; single-hart step chunk 512; SBI IPI only sets targeted harts; bounds-checked hart masks in IPI/RFENCE loops; SBI RFENCE.I via vm_fence_i; added bench-login make target.

Written for commit 31ed456. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

@jserv jserv force-pushed the fast-interpreter branch from 7a947f2 to 769d6e1 Compare April 24, 2026 08:51
This rewrites the interpreter hot path with computed-goto dispatch, TLB
precomputation, and system-level optimizations. Measured on Threadripper
2990WX (GCC 14.2, -O2, pinned core):
  Master: 3269ms avg (3248-3295ms)
  This:   1863ms avg (1857-1870ms)  -43.0%

Hardware counter comparison:
  Instructions:   26.86B -> 13.95B  (-48.1%)
  Cycles:         13.12B ->  7.53B  (-42.6%)
  Branches:        5.40B ->  2.19B  (-59.4%)
  Branch misses:    175M ->   121M  (-30.6%)
  IPC:             2.05  ->  1.85

Interpreter changes (riscv.c):
- Computed-goto dispatch with dense 128-entry table
- Per-funct3 sub-dispatch tables for OP_IMM and OP
- Amortized interrupt checking (sequential path skips checks)
- Block chaining: inline icache lookup at branch/jump targets
- TLB data_minus_addr precomputation (single ADD on hit)
- MRU-first open-coded TLB probe with xor-fold hash
- Inlined JAL/JALR/BRANCH handlers (no function calls)
- Conditional MMU flush on trap/sret (skip same-privilege)
- __attribute__((hot, flatten)) on vm_step_many
- __attribute__((cold)) on exception/trap paths

Memory subsystem (riscv.h, riscv.c):
- hart_t struct reorder: hot fields at offset 0, cold caches at end
- 32-set x 2-way load/store TLB (up from 8 sets)
- 16-entry fetch TLB (up from 2 entries)
- Sequential fetch: block-boundary check replaces pointer compare

System-level (main.c, uart.c, device.h):
- UART output buffering (128-byte buffer, flush on newline)
- 512-step execution chunks (up from 128)
- SBI IPI: only set targeted harts, never clear pending
- SBI RFENCE.I: implemented via vm_fence_i
- Hart mask bounds checking in IPI/RFENCE loops

Correctness:
- Single-hart: boots Linux 6.12 to login
- SMP=4: all 4 CPUs come online
@jserv jserv force-pushed the fast-interpreter branch from 769d6e1 to 31ed456 Compare April 24, 2026 09:11
@jserv jserv merged commit 3c2aaaa into master Apr 24, 2026
10 checks passed
@jserv jserv deleted the fast-interpreter branch April 24, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant