Skip to content

Bug in CTC forced alignment: softmax used instead of log_softmax #2755

@Shawn-L23

Description

@Shawn-L23

🐛 Bug in CTC forced alignment: softmax used instead of log_softmax

There's a bug in the CTC forced alignment implementation in SenseVoice model where softmax is incorrectly used instead of log_softmax, causing incorrect alignment results.

To Reproduce

Just read the code, and one could see it.

Code sample

branch: main @252eef8b8b29b603d10bc640bc4f0c3fe12c3604
Location: funasr/models/sense_voice/model.py, line 933

Current (incorrect) code:

logits_speech = self.ctc.softmax(encoder_out)[i, 4 : encoder_out_lens[i].item(), :]
pred = logits_speech.argmax(-1).cpu()
logits_speech[pred == self.blank_id, self.blank_id] = 0
align = ctc_forced_align(
    logits_speech.unsqueeze(0).float(),
    torch.Tensor(token_ids).unsqueeze(0).long().to(logits_speech.device),
    (encoder_out_lens[i] - 4).long(),
    torch.tensor(len(token_ids)).unsqueeze(0).long().to(logits_speech.device),
    ignore_id=self.ignore_id,
)

The issue: The ctc_forced_align function expects log probabilities, not regular probabilities.

Evidence from funasr/models/sense_voice/utils/ctc_alignment.py:

  • Line 3: Parameter is named log_probs
  • Line 12: Docstring states: "log_probs (Tensor): log probability of CTC emission output."
  • Line 53: Uses log-space arithmetic:
best_score[:, padding_num:] = log_probs[:, t].gather(-1, _t_a_r_g_e_t_s_) + prev_max_value

Expected behavior

The code should use log_softmax instead of softmax:

logits_speech = self.ctc.log_softmax(encoder_out)[i, 4 : encoder_out_lens[i].item(), :]

This will provide log probabilities (range: -∞ to 0) as expected by the ctc_forced_align function, instead of regular probabilities (range: 0 to 1).

Environment

Not necessary for this bug.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions