Introduce execution fast paths by jserv · Pull Request #124 · sysprog21/semu

jserv · 2026-04-23T02:44:57Z

This adds multi-level caching and batched execution to reduce per-instruction overhead in the interpreter loop:

Instruction decode and fetch:

Pre-decoded instruction struct (decoded_insn_t) packs register indices and opcode fields into a single word, avoiding repeated bit extraction on hot paths (CSR, AMO, SYSTEM opcodes).
Sequential fetch fast-path: track the current I-cache block pointer so consecutive PC+4 fetches read directly from the block without re-entering mmu_fetch.
vm_step_many() runs up to N instructions in a tight loop, reading sequentially from the I-cache block via a local pointer (seq_ptr) and falling back to mmu_fetch only on block/page boundaries and taken branches.

Address translation caching:

1-entry last-VPN cache in front of the 8-set 2-way TLB, skipping set-index hashing and way lookup on same-page accesses (common in tight loops and sequential memory scans).
mmu_invalidate_range() flushes the last-VPN entries when the invalidated range covers them, ensuring SBI RFENCE.VMA correctness.

RAM fast path:

Per-hart ram_load_last_page/ram_store_last_page cache the host pointer for the most recent physical page, bypassing the mem_load/mem_store function pointer indirection for translated RAM accesses.
ram_read_fast/ram_write_fast perform inline load/store with alignment checks, avoiding the full device-dispatch path.

Execution loop restructuring:

semu_run_chunk() passes the full batch size (128 single-core, 8 slirp) to vm_step_many(), amortizing sequential-fetch setup.
semu_service_hart_step() wraps per-instruction peripheral ticks for SMP paths where interrupt responsiveness matters.
semu_step_chunk() handles exception/ecall/trap recovery across a batch, re-entering vm_step_many() with the remaining budget.

Boot time to login prompt, SMP=1, Linux 6.12 on Threadripper 2990WX:

master (7afe818): avg 10838ms (5 runs, range 10776-10884ms)
this commit: avg 3517ms (5 runs, range 3492- 3560ms)

3.08x faster.

This adds multi-level caching and batched execution to reduce per- instruction overhead in the interpreter loop: Instruction decode and fetch: - Pre-decoded instruction struct (decoded_insn_t) packs register indices and opcode fields into a single word, avoiding repeated bit extraction on hot paths (CSR, AMO, SYSTEM opcodes). - Sequential fetch fast-path: track the current I-cache block pointer so consecutive PC+4 fetches read directly from the block without re-entering mmu_fetch. - vm_step_many() runs up to N instructions in a tight loop, reading sequentially from the I-cache block via a local pointer (seq_ptr) and falling back to mmu_fetch only on block/page boundaries and taken branches. Address translation caching: - 1-entry last-VPN cache in front of the 8-set 2-way TLB, skipping set-index hashing and way lookup on same-page accesses (common in tight loops and sequential memory scans). - mmu_invalidate_range() flushes the last-VPN entries when the invalidated range covers them, ensuring SBI RFENCE.VMA correctness. RAM fast path: - Per-hart ram_load_last_page/ram_store_last_page cache the host pointer for the most recent physical page, bypassing the mem_load/mem_store function pointer indirection for translated RAM accesses. - ram_read_fast/ram_write_fast perform inline load/store with alignment checks, avoiding the full device-dispatch path. Execution loop restructuring: - semu_run_chunk() passes the full batch size (128 single-core, 8 slirp) to vm_step_many(), amortizing sequential-fetch setup. - semu_service_hart_step() wraps per-instruction peripheral ticks for SMP paths where interrupt responsiveness matters. - semu_step_chunk() handles exception/ecall/trap recovery across a batch, re-entering vm_step_many() with the remaining budget. Boot time to login prompt, SMP=1, Linux 6.12 on Threadripper 2990WX: master (7afe818): avg 10838ms (5 runs, range 10776-10884ms) this commit: avg 3517ms (5 runs, range 3492- 3560ms) 3.08x faster.

This comment was marked as resolved.

Sign in to view

jserv force-pushed the performance branch from e9d1ba6 to 640301a Compare April 23, 2026 03:02

jserv force-pushed the performance branch from 640301a to eb89de8 Compare April 23, 2026 03:24

jserv merged commit 6e5265a into master Apr 23, 2026
10 checks passed

jserv deleted the performance branch April 23, 2026 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce execution fast paths#124

Introduce execution fast paths#124
jserv merged 1 commit intomasterfrom
performance

jserv commented Apr 23, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jserv commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jserv commented Apr 23, 2026 •

edited

Loading