今天介绍一个在SGLang中针对DeepSeek V3模型中的https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/moe/topk.py#L99-L149部分的biased_grouped_topk函数的kernel优化,在DeepSeek V3端到端测试中吞吐提升5%以上。这个函数用于DeepSeek V3/R1模型中的MOE层,用于计算每个token的专家选择概率。
It is based on the observation that the TensorRT MoE kernels are working very well in the small batch size regime, whereas the fused MoE kernel is working much better in the large batch size regime. I have been trying to optimize the triton kernels in the small batch size regime too, but...
270 + invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1, 271 + topk_weights, topk_ids, sorted_token_ids, 272 + expert_ids, num_tokens_post_padded, False, 273 + topk_ids.shape[1], config) 274 + 275 + ops.silu_and_mul(intermediate_cache2, intermediate_cache1.vie...
A high-throughput and memory-efficient inference and serving engine for LLMs - [Kernel] W8A16 Int8 inside FusedMoE (#7415) · wenxcs-msft/vllm-xx@7fc23be
Tune MI300X fused MoE Triton kernel JSON config. Align the JSON configs between MI300X and Radeon Graphics for BS=64, E=256, N=256, dtype=fp8_w8a8, block_shape=[128,128] case. The best tuned config was submitted in#3418but it was only for Radeon Graphics. Let MI300X adopt the ...
267 + sorted_token_ids, expert_ids, num_tokens_post_padded = moe_align_block_size( 268 + topk_ids, config['BLOCK_SIZE_M'], E) 269 + 270 + invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1, 271 + topk_weights, topk_ids, sorted_token_ids, 272 + expert_ids,...
A high-throughput and memory-efficient inference and serving engine for LLMs - [Kernel] W8A16 Int8 inside FusedMoE (#7415) · alex-jw-brooks/vllm@7fc23be