The training speed is the same, but we will have to see on larger max-duration and larger model size. 👍 1 Collaborator Author marcoyang1998 commented Aug 2, 2024 Experiments on full LibriSpeech. The model is trained for 30 epochs using 4 A100 GPUs. WERs are obtained using modified_...
Because memory capacity is not the only relevant metric? I see. The RTX 4090 is based on the Ada Lovelace architecture which is newer than the Ampere architecture, which in turn is newer than the Turing architecture that the Quadro RTX 6000 uses I understand. e.g. is “for example”...
(need hoper TE support) is currently only supported for compute capability >= 80") else: # TODO (yiakwy) : add FP8 support raise NotImplemented output = flash_attn_func(q, k, v, dropout_p=self.dropout.p, causal=is_causal) output = revert_mold_flash_attn_input(output) if output_...