Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions docs/planning/ARM64_CPU0_SMP_INVESTIGATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -3239,3 +3239,85 @@ AHCI Port1 `IS=0x1` and no completion for a later command after the local wake
path has already switched successfully. The next audit should focus on AHCI
interrupt completion / EOI-deactivation ordering or command publication, not on
the local IRQ-return wake-buffer drain.

## 2026-04-16 - F18 AHCI CI-level completion loop

F18 targeted the AHCI completion race isolated by F17: timeout dumps showed
`CI=0x0` and `AHCI_PORT1_IS=0x1` while the waiter was blocked on a command
newer than the last command completed by the ISR. That means hardware had
completed the command, but the software completion path missed the waiter wake.

### Audit summary

The F17 Breenix AHCI handler was edge-sensitive. It read `PORT_IS`, cleared
`PORT_IS`/`HBA_IS`, read `PORT_CI` once, and completed at most one slot from
that single sampled interrupt state. That can miss a completion once the
hardware `PORT_CI` bit has already dropped but no new software completion pass
runs for the active slot.

Linux v6.8 is level-sensitive. In `/tmp/linux-v6.8/drivers/ata/libahci.c`,
`ahci_port_intr()` reads and acknowledges `PORT_IRQ_STAT` at lines 1963-1964,
then calls `ahci_handle_port_interrupt()` at line 1966. The command completion
path runs through `ahci_qc_complete()`; it reads `PORT_SCR_ACT` and
`PORT_CMD_ISSUE` into `qc_active` at lines 1875-1885, then calls
`ata_qc_complete_multiple(ap, qc_active)` at line 1888. The important parity
point is deriving completions from currently active hardware command bits, not
from a single interrupt edge.

### Fix description

`kernel/src/drivers/ahci/mod.rs` now bounds the AHCI completion drain loop at
eight iterations. For each active port, it:

- reads the per-port active software mask and `PORT_CI`;
- computes completed slots as `PORT_ACTIVE_MASK & !PORT_CI`;
- clears completed active bits atomically;
- acknowledges sampled `PORT_IS`;
- re-reads `PORT_IS` and `PORT_CI`, looping if the port reasserted or another
active slot is now clear.

Because the current driver issues only slot 0, the slot-0 completion token is
recorded during the CI loop and the actual wake is published only after the
port interrupt is stable. This prevents the woken waiter from issuing the next
slot-0 command while the prior AHCI interrupt line remains asserted. The prior
single-active-slot fallback for controllers that signal completion before
`PORT_CI` is observed clear was preserved.

F18 also added AHCI ring site `CI_LOOP`; the loop iteration count is emitted
in the event `token` field. Passing validation runs did not trigger
timeout-time AHCI ring dumps, so visible `ahci_ci_loop_iterations` is `0` in
the final serial logs.

### Validation sweep

Artifacts:

```text
logs/breenix-parallels-cpu0/f18-ahci-ci-loop/run{1..5}/
logs/breenix-parallels-cpu0/f18-ahci-ci-loop/summary.txt
logs/breenix-parallels-cpu0/f18-ahci-ci-loop/exit.md
```

Each `./run.sh --parallels --test 60` invocation exited 1 because the
Parallels screenshot helper could not find the generated VM window. As in
prior F-series sweeps, the serial log is the validation source.

| Run | Reached bsshd | AHCI timeout | Corruption markers | Failed exec | Soft lockup | `ahci_ci_loop_iterations` | Result |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| run1 | 1 | 0 | 0 | 0 | 0 | 0 | PASS |
| run2 | 1 | 0 | 0 | 0 | 0 | 0 | PASS |
| run3 | 1 | 0 | 0 | 0 | 0 | 0 | PASS |
| run4 | 1 | 0 | 0 | 0 | 0 | 0 | PASS |
| run5 | 1 | 0 | 0 | 0 | 0 | 0 | PASS |

### Verdict

Verdict: **PASS**. F18 reached the target 5/5 Parallels sweep: every final run
reached `[init] bsshd started (PID 2)` with `ahci_timeouts=0` and
`corruption_markers=0`.

The ARM64 CPU0/AHCI timeout investigation is therefore complete for this
failure signature. Recommended next step: open a cleanup PR to remove the
F8-F17 diagnostic scaffolding and keep only the minimal AHCI CI-level
completion behavior plus any low-cost regression signal needed for future
AHCI timeout triage.
229 changes: 149 additions & 80 deletions kernel/src/drivers/ahci/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -291,6 +291,7 @@ pub(crate) const AHCI_TRACE_RESCHED_CHECK_ENTRY: u32 = 32;
pub(crate) const AHCI_TRACE_RESCHED_CHECK_DRAINED_WAKE: u32 = 33;
pub(crate) const AHCI_TRACE_RESCHED_CHECK_SWITCHED: u32 = 34;
pub(crate) const AHCI_TRACE_RESCHED_CHECK_RETURN: u32 = 35;
pub(crate) const AHCI_TRACE_CI_LOOP: u32 = 36;

struct AhciTraceSlot {
site: AtomicU32,
Expand Down Expand Up @@ -461,6 +462,7 @@ fn ahci_trace_site_name(site: u32) -> &'static str {
AHCI_TRACE_RESCHED_CHECK_DRAINED_WAKE => "RESCHED_CHECK_DRAINED_WAKE",
AHCI_TRACE_RESCHED_CHECK_SWITCHED => "RESCHED_CHECK_SWITCHED",
AHCI_TRACE_RESCHED_CHECK_RETURN => "RESCHED_CHECK_RETURN",
AHCI_TRACE_CI_LOOP => "CI_LOOP",
_ => "UNKNOWN",
}
}
Expand Down Expand Up @@ -2625,13 +2627,17 @@ fn probe_platform_irq(ctrl: &AhciController) {

/// AHCI MSI interrupt handler — called from the IRQ dispatch in `exception.rs`.
///
/// Reads HBA_IS to identify which port(s) fired, reads and clears PORT_IS,
/// then sets the per-port `AHCI_PORT_COMPLETE` flag so `issue_cmd_slot0` can
/// wake up. Clears HBA_IS last (AHCI spec requires PORT_IS cleared first).
/// Reads HBA_IS to identify which port(s) fired, acknowledges the sampled
/// PORT_IS, then derives completions from the per-port active-slot mask and
/// PORT_CI until the port is stable.
///
/// This function must be lock-free and allocation-free (called from IRQ context).
const AHCI_CI_COMPLETION_LOOP_LIMIT: u32 = 8;
const AHCI_TRACKED_SLOT_MASK: u32 = (1u32 << AHCI_MAX_CONCURRENT) - 1;

#[inline]
fn detect_completed_slots(active: u32, ci_after: u32, port_is: u32) -> u32 {
let active = active & AHCI_TRACKED_SLOT_MASK;
let mut completed = active & !ci_after;

if completed == 0
Expand Down Expand Up @@ -2715,11 +2721,12 @@ pub fn handle_interrupt() {
if !check_all && (hba_is & (1 << port)) == 0 {
continue;
}
let is = port_read(abar, port, PORT_IS);
if is == 0 {
let mut is = port_read(abar, port, PORT_IS);
let ci_entry = port_read(abar, port, PORT_CI);
let active_entry = PORT_ACTIVE_MASK[port].load(Ordering::Acquire) & AHCI_TRACKED_SLOT_MASK;
if is == 0 && (active_entry & !ci_entry) == 0 {
continue;
}
let ci_entry = port_read(abar, port, PORT_CI);
let sact_entry = port_read(abar, port, PORT_SACT);
let serr_entry = port_read(abar, port, PORT_SERR);
push_ahci_event(
Expand All @@ -2730,100 +2737,162 @@ pub fn handle_interrupt() {
sact_entry,
serr_entry,
AHCI_COMPLETIONS[port][0].waiter_tid(),
0,
active_entry,
PORT_ACTIVE_CMD_NUM[port].load(Ordering::Acquire),
false,
);
AHCI_ISR_PORT_HIT[port].fetch_add(1, Ordering::Relaxed);
if is != 0 {
AHCI_ISR_PORT_HIT[port].fetch_add(1, Ordering::Relaxed);
}
if port == 1 {
AHCI_LAST_ISR_PORT1_IS.store(is, Ordering::Relaxed);
}
ack_port_interrupt(abar, port, is);

let is_after_clear = port_read(abar, port, PORT_IS);
let ci_after_clear = port_read(abar, port, PORT_CI);
let sact_after_clear = port_read(abar, port, PORT_SACT);
let active_mask = PORT_ACTIVE_MASK[port].load(Ordering::Acquire);
let completed_slots = detect_completed_slots(active_mask, ci_after_clear, is);
let mut waiter_tid = 0u64;
let mut cmd_num = 0u32;
let mut wake_success = false;
let mut slots_processed = 0u32;
let mut loop_iterations = 0u32;
// Slot 0 is the only issued slot today. Defer the wake until after
// PORT_IS has been acknowledged and the CI drain loop is stable, so the
// woken thread cannot issue the next command while this interrupt line
// is still asserted.
let mut pending_cmd_num = 0u32;
let mut pending_waiter_tid = 0u64;
let mut pending_slot_bit = 0u32;
let mut pending_is = 0u32;
let mut pending_ci = 0u32;
let mut pending_sact = 0u32;
let mut pending_serr = 0u32;
let mut is_after_clear: u32;
let mut ci_after_clear: u32;
let mut sact_after_clear: u32;
let mut serr_after_clear: u32;

loop {
loop_iterations += 1;

let sampled_is = is;
if sampled_is != 0 {
ack_port_interrupt(abar, port, sampled_is);
}

let ci = port_read(abar, port, PORT_CI);
let sact = port_read(abar, port, PORT_SACT);
let serr = port_read(abar, port, PORT_SERR);
let active_mask =
PORT_ACTIVE_MASK[port].load(Ordering::Acquire) & AHCI_TRACKED_SLOT_MASK;
let completed_slots = detect_completed_slots(active_mask, ci, sampled_is);

push_ahci_event(
AHCI_TRACE_CI_LOOP,
port as u32,
sampled_is,
ci,
sact,
serr,
AHCI_COMPLETIONS[port][0].waiter_tid(),
completed_slots,
loop_iterations,
completed_slots != 0,
);

let mut remaining = completed_slots;
while remaining != 0 {
let slot = remaining.trailing_zeros() as usize;
let slot_bit = 1u32 << slot;
remaining &= !slot_bit;

if (is & (PORT_IRQ_COMPLETE | PORT_IRQ_ERROR)) != 0 {
if port < MAX_AHCI_PORTS {
// Atomically load-and-clear the active command number.
// If cmd_num > 0, this is the first ISR invocation for this
// command — signal completion. If cmd_num == 0, a previous
// ISR already completed this command (level-triggered
// re-delivery or spurious). Calling complete(0) would
// overwrite done with 0, destroying the real token and
// causing the waiter to time out.
cmd_num = PORT_ACTIVE_CMD_NUM[port].swap(0, Ordering::AcqRel);
if port == 1 {
AHCI_LAST_ISR_PORT1_CMD_NUM.store(cmd_num, Ordering::Relaxed);
let was_active =
(PORT_ACTIVE_MASK[port].fetch_and(!slot_bit, Ordering::AcqRel) & slot_bit) != 0;
if !was_active {
continue;
}
waiter_tid = AHCI_COMPLETIONS[port][0].waiter_tid();
push_ahci_event(
AHCI_TRACE_POST_CLEAR,
port as u32,
is_after_clear,
ci_after_clear,
sact_after_clear,
serr_entry,
waiter_tid,
completed_slots,
cmd_num,
false,
);
if cmd_num != 0 {
AHCI_LAST_ISR_COMPLETE_PORT.store(port as u32, Ordering::Relaxed);
AHCI_LAST_ISR_COMPLETE_CMD.store(cmd_num, Ordering::Relaxed);
AHCI_LAST_ISR_COMPLETE_WAITER.store(waiter_tid, Ordering::Relaxed);
push_ahci_event(
AHCI_TRACE_BEFORE_COMPLETE,
port as u32,
is_after_clear,
ci_after_clear,
sact_after_clear,
serr_entry,
waiter_tid,
completed_slots,
cmd_num,
false,
);
AHCI_COMPLETIONS[port][0].complete(cmd_num);
push_ahci_event(
AHCI_TRACE_AFTER_COMPLETE,
port as u32,
is_after_clear,
ci_after_clear,
sact_after_clear,
serr_entry,
waiter_tid,
completed_slots,
cmd_num,
false,
);
AHCI_ISR_COMPLETE_HIT[port].fetch_add(1, Ordering::Relaxed);
wake_success = waiter_tid != 0;
slots_processed = completed_slots.count_ones();

slots_processed += 1;

// The current driver issues only slot 0. The arrays are sized
// for future multi-slot work, but only slot 0 has a command
// token to publish back to wait_cmd_slot0().
if slot == 0 {
let cmd_num = PORT_ACTIVE_CMD_NUM[port].swap(0, Ordering::AcqRel);
if port == 1 {
AHCI_LAST_ISR_PORT1_CMD_NUM.store(cmd_num, Ordering::Relaxed);
}
waiter_tid = AHCI_COMPLETIONS[port][slot].waiter_tid();
if cmd_num != 0 {
pending_cmd_num = cmd_num;
pending_waiter_tid = waiter_tid;
pending_slot_bit = slot_bit;
pending_is = sampled_is;
pending_ci = ci;
pending_sact = sact;
pending_serr = serr;
}
}
PORT_ACTIVE_MASK[port].store(0, Ordering::Release);
}
} else {

is_after_clear = port_read(abar, port, PORT_IS);
ci_after_clear = port_read(abar, port, PORT_CI);
sact_after_clear = port_read(abar, port, PORT_SACT);
serr_after_clear = port_read(abar, port, PORT_SERR);
let active_after =
PORT_ACTIVE_MASK[port].load(Ordering::Acquire) & AHCI_TRACKED_SLOT_MASK;
let completed_after_clear = active_after & !ci_after_clear;

push_ahci_event(
AHCI_TRACE_POST_CLEAR,
port as u32,
is_after_clear,
ci_after_clear,
sact_after_clear,
serr_entry,
AHCI_COMPLETIONS[port][0].waiter_tid(),
completed_slots,
0,
serr_after_clear,
waiter_tid,
completed_after_clear,
loop_iterations,
completed_after_clear != 0,
);

if loop_iterations >= AHCI_CI_COMPLETION_LOOP_LIMIT {
break;
}
if is_after_clear == 0 && completed_after_clear == 0 {
break;
}
is = is_after_clear;
}

if pending_cmd_num != 0 {
AHCI_LAST_ISR_COMPLETE_PORT.store(port as u32, Ordering::Relaxed);
AHCI_LAST_ISR_COMPLETE_CMD.store(pending_cmd_num, Ordering::Relaxed);
AHCI_LAST_ISR_COMPLETE_WAITER.store(pending_waiter_tid, Ordering::Relaxed);
push_ahci_event(
AHCI_TRACE_BEFORE_COMPLETE,
port as u32,
pending_is,
pending_ci,
pending_sact,
pending_serr,
pending_waiter_tid,
pending_slot_bit,
pending_cmd_num,
false,
);
AHCI_COMPLETIONS[port][0].complete(pending_cmd_num);
push_ahci_event(
AHCI_TRACE_AFTER_COMPLETE,
port as u32,
pending_is,
pending_ci,
pending_sact,
pending_serr,
pending_waiter_tid,
pending_slot_bit,
pending_cmd_num,
false,
);
AHCI_ISR_COMPLETE_HIT[port].fetch_add(1, Ordering::Relaxed);
waiter_tid = pending_waiter_tid;
wake_success |= pending_waiter_tid != 0;
}

push_ahci_event(
Expand All @@ -2832,10 +2901,10 @@ pub fn handle_interrupt() {
is_after_clear,
ci_after_clear,
sact_after_clear,
serr_entry,
serr_after_clear,
waiter_tid,
slots_processed,
cmd_num,
loop_iterations,
wake_success,
);
}
Expand Down
Loading