Skip to content

Introduce execution fast paths#124

Merged
jserv merged 1 commit intomasterfrom
performance
Apr 23, 2026
Merged

Introduce execution fast paths#124
jserv merged 1 commit intomasterfrom
performance

Conversation

@jserv
Copy link
Copy Markdown
Collaborator

@jserv jserv commented Apr 23, 2026

This adds multi-level caching and batched execution to reduce per-instruction overhead in the interpreter loop:

Instruction decode and fetch:

  • Pre-decoded instruction struct (decoded_insn_t) packs register indices and opcode fields into a single word, avoiding repeated bit extraction on hot paths (CSR, AMO, SYSTEM opcodes).
  • Sequential fetch fast-path: track the current I-cache block pointer so consecutive PC+4 fetches read directly from the block without re-entering mmu_fetch.
  • vm_step_many() runs up to N instructions in a tight loop, reading sequentially from the I-cache block via a local pointer (seq_ptr) and falling back to mmu_fetch only on block/page boundaries and taken branches.

Address translation caching:

  • 1-entry last-VPN cache in front of the 8-set 2-way TLB, skipping set-index hashing and way lookup on same-page accesses (common in tight loops and sequential memory scans).
  • mmu_invalidate_range() flushes the last-VPN entries when the invalidated range covers them, ensuring SBI RFENCE.VMA correctness.

RAM fast path:

  • Per-hart ram_load_last_page/ram_store_last_page cache the host pointer for the most recent physical page, bypassing the mem_load/mem_store function pointer indirection for translated RAM accesses.
  • ram_read_fast/ram_write_fast perform inline load/store with alignment checks, avoiding the full device-dispatch path.

Execution loop restructuring:

  • semu_run_chunk() passes the full batch size (128 single-core, 8 slirp) to vm_step_many(), amortizing sequential-fetch setup.
  • semu_service_hart_step() wraps per-instruction peripheral ticks for SMP paths where interrupt responsiveness matters.
  • semu_step_chunk() handles exception/ecall/trap recovery across a batch, re-entering vm_step_many() with the remaining budget.

Boot time to login prompt, SMP=1, Linux 6.12 on Threadripper 2990WX:

  • master (7afe818): avg 10838ms (5 runs, range 10776-10884ms)
  • this commit: avg 3517ms (5 runs, range 3492- 3560ms)

3.08x faster.

cubic-dev-ai[bot]

This comment was marked as resolved.

This adds multi-level caching and batched execution to reduce per-
instruction overhead in the interpreter loop:

Instruction decode and fetch:
- Pre-decoded instruction struct (decoded_insn_t) packs register
  indices and opcode fields into a single word, avoiding repeated bit
  extraction on hot paths (CSR, AMO, SYSTEM opcodes).
- Sequential fetch fast-path: track the current I-cache block pointer
  so consecutive PC+4 fetches read directly from the block without
  re-entering mmu_fetch.
- vm_step_many() runs up to N instructions in a tight loop, reading
  sequentially from the I-cache block via a local pointer (seq_ptr)
  and falling back to mmu_fetch only on block/page boundaries and
  taken branches.

Address translation caching:
- 1-entry last-VPN cache in front of the 8-set 2-way TLB, skipping
  set-index hashing and way lookup on same-page accesses (common in
  tight loops and sequential memory scans).
- mmu_invalidate_range() flushes the last-VPN entries when the
  invalidated range covers them, ensuring SBI RFENCE.VMA correctness.

RAM fast path:
- Per-hart ram_load_last_page/ram_store_last_page cache the host
  pointer for the most recent physical page, bypassing the
  mem_load/mem_store function pointer indirection for translated RAM
  accesses.
- ram_read_fast/ram_write_fast perform inline load/store with
  alignment checks, avoiding the full device-dispatch path.

Execution loop restructuring:
- semu_run_chunk() passes the full batch size (128 single-core, 8
  slirp) to vm_step_many(), amortizing sequential-fetch setup.
- semu_service_hart_step() wraps per-instruction peripheral ticks for
  SMP paths where interrupt responsiveness matters.
- semu_step_chunk() handles exception/ecall/trap recovery across a
  batch, re-entering vm_step_many() with the remaining budget.

Boot time to login prompt, SMP=1, Linux 6.12 on Threadripper 2990WX:
  master (7afe818):  avg 10838ms (5 runs, range 10776-10884ms)
  this commit:       avg  3517ms (5 runs, range  3492- 3560ms)

3.08x faster.
@jserv jserv merged commit 6e5265a into master Apr 23, 2026
10 checks passed
@jserv jserv deleted the performance branch April 23, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant