48 Commits

Author SHA1 Message Date
Rain-Bus ffd2defdfc add Chinese annotations to all source files for learning purposes
Annotated 16 source files covering the full architecture:
engine (scheduler, block manager, model runner), layers (attention,
linear, sampler, etc.), model (qwen3), and utils.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 21:33:15 +08:00
GeekExplorer 9fa256a56d fix cache hit 2026-04-26 03:49:14 +08:00
GeekExplorer f64d821c20 fix chunked prefill bugs and refactor 2026-04-26 02:53:06 +08:00
Xingkai Yu 44a51afc8a Merge pull request #207 from DestineG/fix-prefill-index-out-of-range
fix(scheduler): recalculate num_tokens after allocate to prevent IndexError at model_runner:155
2026-04-25 18:12:43 +08:00
Tai An 25794a1f29 fix(model_runner): correct seqlen_k to chunk boundary in prepare_prefill
During chunked prefill, seqlen_k was set to len(seq) (the full sequence
length), causing the attention kernel to access uninitialized KV slots
for tokens not yet scheduled in the current chunk.

Fix: reorder so that end = start + seqlen_q is computed first, then
set seqlen_k = end — limiting attention to the current chunk boundary.

Fixes #212
2026-04-22 15:13:19 -07:00
six 77dd709ca1 fix(scheduler): recalculate num_tokens after allocate to prevent IndexError
The scheduler overestimated num_scheduled_tokens because it used an outdated num_cached_tokens before block_manager.allocate(seq) could update it via prefix cache hits. In prepare_prefill (model_runner.py), this caused 'end = start + seqlen_q' to exceed the sequence length, leading to an inflated 'end_block'. Consequently, an 'index out of range' error occurred at line 155 when accessing seq.block_table[i] beyond its actual physical allocation.
2026-04-20 16:34:27 +08:00
GeekExplorer 8d63a98c03 support chunked prefill and fix minor bug 2026-04-14 03:05:35 +08:00
GeekExplorer 9e8507ef41 minor simplify 2026-04-13 22:09:46 +08:00
Xingkai Yu 02a95fdc66 Merge pull request #203 from Anai-Guo/fix-row-parallel-bias-crash
fix RowParallelLinear weight_loader crash when bias is enabled
2026-04-13 21:26:07 +08:00
Xingkai Yu a4f94cb38b Merge pull request #200 from KinglittleQ/fix-scheduler-typing
Fix scheduler.postprocess return type
2026-04-13 21:13:37 +08:00
Xingkai Yu 00eea73176 Merge pull request #172 from IceCreamMilkyTea/main
enable 'slots=True' for dataclasses
2026-04-13 20:45:50 +08:00
Xingkai Yu 52d2215911 Merge pull request #148 from guodongxiaren/main
remove hard code for block_size
2026-04-13 20:36:32 +08:00
Anai-Guo bf99453d90 fix RowParallelLinear weight_loader crash when bias is enabled
When RowParallelLinear has bias=True, the weight_loader crashes with an
IndexError because it calls param_data.size(tp_dim) where tp_dim=1, but
the bias tensor is 1D and only has dimension 0.

The bias in RowParallelLinear is not sharded (all ranks hold the full
bias, only rank 0 applies it), so skip the sharding logic for 1D params.

Fixes GeeeekExplorer/nano-vllm#125
2026-04-12 23:10:52 -07:00
Chengqi Deng 498f5a1aa8 Fix scheduler.postprocess return type 2026-04-11 13:23:49 +08:00
IceCreamMilkyTea f438ce463f enable slots=True for dataclasses 2026-02-08 23:46:01 -05:00
guodongxiaren 55c64e7fdf remove hard code for block_size 2025-12-30 01:55:17 +08:00
Mengqi 82f5ca244f fix bug for tp 2025-12-18 01:28:25 +08:00
GeeeekExplorer 2f21442653 support qwen2 2025-11-04 01:44:42 +08:00
GeeeekExplorer 6ef2a4f630 compile random sampling 2025-08-31 22:55:34 +08:00
GeeeekExplorer df99418f7d simplify 2025-08-31 20:02:51 +08:00
PeterDing f5b4840276 fix(model_runner): correct position indexing to be 0-based
- Change position calculation from len(seq) to len(seq) - 1
2025-07-04 14:29:12 +08:00
GeeeekExplorer 38baf0bbe4 remove assert shape 2025-06-27 23:00:30 +08:00
GeeeekExplorer cb0b3dec3f remove rng state 2025-06-27 22:50:33 +08:00
GeeeekExplorer 1caeec8dfa same as vllm 2025-06-27 18:50:56 +08:00
GeeeekExplorer 658520b788 warmup and allocate 2025-06-27 01:51:57 +08:00
xiaohajiayou 054aec852d Fix: Division-by-Zero Risk and Typo 2025-06-24 02:02:33 +08:00
GeeeekExplorer 03cfc13bb3 faster pickle 2025-06-23 00:51:52 +08:00
GeeeekExplorer cde3fc22c2 simplify 2025-06-21 17:19:15 +08:00
jinghuan-Chen ffafaeb133 Release CUDA Graphs resource before exit. 2025-06-18 16:17:31 +08:00
Xingkai Yu 4fc764f175 Merge pull request #22 from cheunglei/use_spawn 2025-06-17 23:53:59 +08:00
cheunglei b5ace32982 use spawn 2025-06-17 23:49:15 +08:00
GeeeekExplorer bc0ad5a116 better 2025-06-17 23:33:38 +08:00
GeeeekExplorer 7e42fa6f63 fix 2025-06-15 13:28:29 +08:00
Xingkai Yu 326b121fad Merge pull request #10 from MARD1NO/refine_return_hint_in_schedule 2025-06-15 10:39:51 +08:00
GeeeekExplorer fc778a4da9 better 2025-06-15 10:36:45 +08:00
MARD1NO 98bbbefb68 schedule return bool args 2025-06-15 10:15:05 +08:00
cheunglei 53b3ef2e32 support tensor parallel 2025-06-15 01:31:24 +08:00
GeeeekExplorer b6136383c9 support fast pickle 2025-06-14 13:36:57 +08:00
GeeeekExplorer 4a8aa090a7 fix 2025-06-14 00:56:07 +08:00
GeeeekExplorer 59aa3ff57c better 2025-06-13 13:07:33 +08:00
GeeeekExplorer 98a1551a7d support CUDA_VISIBLE_DEVICES 2025-06-12 23:14:01 +08:00
GeeeekExplorer ec3c60d96f update bench 2025-06-12 22:54:51 +08:00
GeeeekExplorer f16adb729e refactor 2025-06-12 09:41:12 +08:00
GeeeekExplorer fee58d44e4 fix 2025-06-12 01:00:31 +08:00
GeeeekExplorer 08c84ec08d multi file loader 2025-06-12 01:00:09 +08:00
GeeeekExplorer 386290d69e refactor 2025-06-11 21:12:57 +08:00
GeeeekExplorer b98e1ca305 fix 2025-06-10 21:25:54 +08:00
GeeeekExplorer a5a4909e6a init commit 2025-06-10 00:27:01 +08:00