I'll upstream that kernel in the next PR) ## Checklist - [x] Make FA3 template compatible with deepseek model shape - [x] Make FA2 template compatible with deepseek model shape - [x] Fix AOT compilation scripts - [x] Fix C++ tests/benchmarks ## Changes to the programming interface...