fix(model_runner): correct seqlen_k to chunk boundary in prepare_prefill

During chunked prefill, seqlen_k was set to len(seq) (the full sequence
length), causing the attention kernel to access uninitialized KV slots
for tokens not yet scheduled in the current chunk.

Fix: reorder so that end = start + seqlen_q is computed first, then
set seqlen_k = end — limiting attention to the current chunk boundary.

Fixes #212
This commit is contained in:
Tai An
2026-04-22 15:13:19 -07:00
parent 812eb1c1e4
commit 25794a1f29
+1 -1
View File
@@ -139,8 +139,8 @@ class ModelRunner:
seqlen = len(seq) seqlen = len(seq)
start = min(seq.num_cached_tokens, seqlen - 1) start = min(seq.num_cached_tokens, seqlen - 1)
seqlen_q = seq.num_scheduled_tokens seqlen_q = seq.num_scheduled_tokens
seqlen_k = seqlen
end = start + seqlen_q end = start + seqlen_q
seqlen_k = end
input_ids.extend(seq[start:end]) input_ids.extend(seq[start:end])
positions.extend(range(start, end)) positions.extend(range(start, end))
cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q) cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)