triton.Config({'BLOCK_SIZE_M':32,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,num_warps=2),# Good configforfp8 inputs.triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':128,'GROUP_SIZE_M':8},num_stages=3,num_warps=8),triton.C...
classtriton.Config(self,kwargs,num_warps=4,num_stages=2,num_ctas=1,maxnreg=None,pre_hook=None) 表示自动调优可能尝试的内核配置的对象 变量: kwargs– 1 个元参数字典,用于作为关键字参数传递给内核。 num_warps– 在为 GPU 编译时内核使用的线程数。例如,如果 num_warps=8,则每个内核实例将自动并行...
首先我们来看一下 triton 官方的 GEMM 的 kernel 代码: defget_autotune_config():return[triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':64,'GROUP_SIZE_M':8},num_stages=3,num_warps=8),triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':32,...
triton.Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8), triton.Config({'BLOCK_SIZE_M': 64, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=4, num_warps=4), ... ...
"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 128, "GROUP_SIZE_M": 8, "num_stages": 4, "num_warps": 8 }, torch.float16: { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, "num_stages": 3, ...
# 如果 `num_pid_m` 不能被 `GROUP_SIZE_M` 整除,最后一组会比较小 group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M) # *Within groups*, programs are ordered in a column-major order # 在组内,程序按列主序排序。
Config({'BLOCK_SIZE_M': 128, 'BLOCK_SIZE_N': 128, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 8}, num_stages=3, num_warps=8), key=['M', 'N', 'K'], # 这个值的变化会带来 调优配置变化 ) @triton.jit def matmul_kernel( a_ptr, b_ptr, c_ptr, # 矩阵指针 a(M, K) ...
triton.Config({'BLOCK_SIZE_M':32,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,num_warps=2), ], key=['M','N','K'], ) img 当我们去调整对应的调优空间 @triton.autotune( configs=[ triton.Config({'BLOCK_SIZE_M':32,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K...
Config({'BLOCK_SIZE_M': 32, 'BLOCK_SIZE_N': 64, 'BLOCK_SIZE_K': 32, 'GROUP_SIZE_M': 8}, num_stages=5, num_warps=2), ], key=['M', 'N', 'K'], # 自动调优关键字 ) @triton.jit def matmul_kernel( # 指向矩阵的指针 a_ptr, b_ptr, c_ptr, # 矩阵维度 M, N, K, ...
triton.Config({'BLOCK_SIZE_M':32,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,num_warps=2), ], key=['M','N','K'], ) 然后通过调用Triton的do_bench就可以将你写的算子跑起来了,do_bench处在python/triton/testing.py下,其中会对每个kernel进行25次的warm_up...