fix(model_runner): correct seqlen_k to chunk boundary in prepare_prefill
During chunked prefill, seqlen_k was set to len(seq) (the full sequence length), causing the attention kernel to access uninitialized KV slots for tokens not yet scheduled in the current chunk. Fix: reorder so that end = start + seqlen_q is computed first, then set seqlen_k = end — limiting attention to the current chunk boundary. Fixes #212
This commit is contained in:
@@ -139,8 +139,8 @@ class ModelRunner:
|
||||
seqlen = len(seq)
|
||||
start = min(seq.num_cached_tokens, seqlen - 1)
|
||||
seqlen_q = seq.num_scheduled_tokens
|
||||
seqlen_k = seqlen
|
||||
end = start + seqlen_q
|
||||
seqlen_k = end
|
||||
input_ids.extend(seq[start:end])
|
||||
positions.extend(range(start, end))
|
||||
cu_seqlens_q.append(cu_seqlens_q[-1] + seqlen_q)
|
||||
|
||||
Reference in New Issue
Block a user