value cuda:0 torch.float16 False torch.Size([1, 32, 1238, 128]) --- max absdiff softmax tensor(0.000488, device='cuda:0', dtype=torch.float16) median absdiff softmax tensor(0., device='cuda:0', dtype=torch.float16) --- SDPA max absdiff (no cast): tensor(0., device='cuda:0...