Conversation
This adds multi-level caching and batched execution to reduce per- instruction overhead in the interpreter loop: Instruction decode and fetch: - Pre-decoded instruction struct (decoded_insn_t) packs register indices and opcode fields into a single word, avoiding repeated bit extraction on hot paths (CSR, AMO, SYSTEM opcodes). - Sequential fetch fast-path: track the current I-cache block pointer so consecutive PC+4 fetches read directly from the block without re-entering mmu_fetch. - vm_step_many() runs up to N instructions in a tight loop, reading sequentially from the I-cache block via a local pointer (seq_ptr) and falling back to mmu_fetch only on block/page boundaries and taken branches. Address translation caching: - 1-entry last-VPN cache in front of the 8-set 2-way TLB, skipping set-index hashing and way lookup on same-page accesses (common in tight loops and sequential memory scans). - mmu_invalidate_range() flushes the last-VPN entries when the invalidated range covers them, ensuring SBI RFENCE.VMA correctness. RAM fast path: - Per-hart ram_load_last_page/ram_store_last_page cache the host pointer for the most recent physical page, bypassing the mem_load/mem_store function pointer indirection for translated RAM accesses. - ram_read_fast/ram_write_fast perform inline load/store with alignment checks, avoiding the full device-dispatch path. Execution loop restructuring: - semu_run_chunk() passes the full batch size (128 single-core, 8 slirp) to vm_step_many(), amortizing sequential-fetch setup. - semu_service_hart_step() wraps per-instruction peripheral ticks for SMP paths where interrupt responsiveness matters. - semu_step_chunk() handles exception/ecall/trap recovery across a batch, re-entering vm_step_many() with the remaining budget. Boot time to login prompt, SMP=1, Linux 6.12 on Threadripper 2990WX: master (7afe818): avg 10838ms (5 runs, range 10776-10884ms) this commit: avg 3517ms (5 runs, range 3492- 3560ms) 3.08x faster.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds multi-level caching and batched execution to reduce per-instruction overhead in the interpreter loop:
Instruction decode and fetch:
Address translation caching:
RAM fast path:
Execution loop restructuring:
Boot time to login prompt, SMP=1, Linux 6.12 on Threadripper 2990WX:
3.08x faster.