Improve interpreter for 43% boot time reduction#126
Merged
Conversation
This rewrites the interpreter hot path with computed-goto dispatch, TLB precomputation, and system-level optimizations. Measured on Threadripper 2990WX (GCC 14.2, -O2, pinned core): Master: 3269ms avg (3248-3295ms) This: 1863ms avg (1857-1870ms) -43.0% Hardware counter comparison: Instructions: 26.86B -> 13.95B (-48.1%) Cycles: 13.12B -> 7.53B (-42.6%) Branches: 5.40B -> 2.19B (-59.4%) Branch misses: 175M -> 121M (-30.6%) IPC: 2.05 -> 1.85 Interpreter changes (riscv.c): - Computed-goto dispatch with dense 128-entry table - Per-funct3 sub-dispatch tables for OP_IMM and OP - Amortized interrupt checking (sequential path skips checks) - Block chaining: inline icache lookup at branch/jump targets - TLB data_minus_addr precomputation (single ADD on hit) - MRU-first open-coded TLB probe with xor-fold hash - Inlined JAL/JALR/BRANCH handlers (no function calls) - Conditional MMU flush on trap/sret (skip same-privilege) - __attribute__((hot, flatten)) on vm_step_many - __attribute__((cold)) on exception/trap paths Memory subsystem (riscv.h, riscv.c): - hart_t struct reorder: hot fields at offset 0, cold caches at end - 32-set x 2-way load/store TLB (up from 8 sets) - 16-entry fetch TLB (up from 2 entries) - Sequential fetch: block-boundary check replaces pointer compare System-level (main.c, uart.c, device.h): - UART output buffering (128-byte buffer, flush on newline) - 512-step execution chunks (up from 128) - SBI IPI: only set targeted harts, never clear pending - SBI RFENCE.I: implemented via vm_fence_i - Hart mask bounds checking in IPI/RFENCE loops Correctness: - Single-hart: boots Linux 6.12 to login - SMP=4: all 4 CPUs come online
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This rewrites the interpreter hot path with computed-goto dispatch, TLB precomputation, and system-level optimizations. Measured on Threadripper 2990WX (GCC 14.2, -O2, pinned core):
Hardware counter comparison:
Interpreter changes (riscv.c):
Memory subsystem (riscv.h, riscv.c):
System-level (main.c, uart.c, device.h):
Correctness:
Summary by cubic
Rewrite the RISC‑V interpreter hot path to cut boot time by ~43% and reduce instruction/branch counts. Adds buffered UART output with prompt-aware flushing and proper SBI RFENCE.I handling.
vm_fence_i; addedbench-loginmake target.Written for commit 31ed456. Summary will update on new commits.