The RTX 30 Ampere and RTX 40 Ada series GPUs additionally have support to perform asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this further by introducing the Tensor Memory Accelerator (TMA) unit. the TMA unit combines asynchronous copies and index calculation...