Annotated 16 source files covering the full architecture:
engine (scheduler, block manager, model runner), layers (attention,
linear, sampler, etc.), model (qwen3), and utils.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The scheduler overestimated num_scheduled_tokens because it used an outdated num_cached_tokens before block_manager.allocate(seq) could update it via prefix cache hits. In prepare_prefill (model_runner.py), this caused 'end = start + seqlen_q' to exceed the sequence length, leading to an inflated 'end_block'. Consequently, an 'index out of range' error occurred at line 155 when accessing seq.block_table[i] beyond its actual physical allocation.